What is it?
Attempt to empirically show the difference in search time between Ordered and Unordered Binary Tree.
We're going to populate Ordered and Unordered Binary Trees with nodes that use words (strings of text) for keys, see Node.java. In case of Ordered tree, String.compareTo() is used to order keys based on lexicographical order. While Unordered tree is populated by randomly picking a path (left/right) from root to the leaf level. See Ordered.java and Unordered.java
The expected result for searching Ordered Binary Tree is O(clog(n)), and O(cn) for Unordered (pre-order tree traversal). Where n is the number of nodes (unique terms) in the Binary Tree and c is some constant representing the beefines of my computer.
- Build Ordered/Unordered Binary Tree with sub-set (size n) of terms from the corpus.
- Compute the average search time by searching for every term in the tree and then taking the average.
- Perform the experiment as the size of sub-set expands from 1,...,entire corpus.
A short story by Ellena Ashley, The Dragon Rock is used for the first experiment (dragon-rock.txt). Dragon Rock contains 636 unique terms out of 1429 total, which means total of 636 nodes in a Binary Tree, and depth of 7 when tree is balanced.
The difference between Ordered and Unordered search is visible, however, due to much noise from Java's VM and unbalace in trees, it is hard to approximate the function that best fits the results. We need more data.
Since the number of unique terms grows approximately linearly with the total number of terms
Let's go ahead and use all of the terms.
Looks better but still too noisy for approximation. How about we copy/paste the story 9 times.
Looks like we now have enough data to start approximating O(n) equation, however the Binary Tree stops growing after all unique terms are exhausted (after about 2000 terms).
Clearly, we need a better corpus.
The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle (holmes.txt) contains 9580 unique words out of 120376 total!
Search in Unordered Binary Tree looks linear with respect to n while Ordered is way at the bottom accross entire graph.
We can now approximate Unordered Binary Tree search. Linear function should look something like t = c*n where we know n (number of nodes) and t (milliseconds) average time it takes to find a node. Averaging c by looking at few experiment data points we find that on this machine, c ~= 0.000016.
Similarly when we zoom-in on Ordered Binary Tree data, it begins to resemble a logarithmic curve: t = c*log(n) where c ~= 0.000055.
We have empirically shown the difference in search times between Unordered (pre-order tree traversal) and Ordered Binary Trees.