Back to my homepage

MTree Tester Applet

This example needs the Java plugin! You can get it here. If everything works fine, a new window with some circles in it should have popped up. (I have only tested it with Java Plugin 1.3 and IE 6, so please report if you don't see the window).

Oops! Something bad happened and I can't display this applet

This applet shows how an M-Tree (metric tree) can be used to make search operations on data sets in metric spaces pretty fast. It uses my MTree implementation in Java based on the C++ library by Patella, Ciaccia, and Zezula (see the project page at university of Bologna) and the underlying paper on Citeseer, and the Java port of the GiST library by the University of South Africa

Before you read on: turn of "display tree structure" before you add 1000s of points or perform a search operation! Unfortunately the VM used for browsers runs in client mode, which means faster startup but slower operation. Turning the VM to server mode can speed the demo up 80-100%.

What is it about?

In the demo you can add points to the index and perform a "range query" on it (if you click "search mode" a green circle appears around the cursor, and all points within the circle are marked green).The range query looks for the objects that are near the mouse cursor. The range query returns the answer to the question "give me all objects with distance <= 100 to the point the cursor points to".

There are a lot of index structures that support this type of query for multidimensional spaces even more efficiently. The M-Tree, however, is special in that way that only a distance function, which resturns the distance between two objects, must be known. This distance function d() must have the following properties:

A trivial distance function, which was applied here, is the euclidian distance between two points in a 2-dimensional (or multidimensional) vector space: d(a,b) = sqrt((b1-a1)^2 + (b2-a2)^2), where sqrt is the square root and ^2 is the power of two.
But the MTree index structure can also be applied to other data structures, such as strings, with the distance function being the edit distance, or Levenshtein distance, which also follows the triangle equation. I plan to add a demo on that pretty soon.

How the tree looks like

I want to be very brief here since there are already some quite accessible papers on the MTree.

The Demo

The default node size of the demo tree is 5 points. (When you clear the tree, a new tree with 8 points per node is created).

When starting the demo, some points are added to the index automatically, leaving you in a state right before a split. This means: In the beginning there is only one node. When you add the sixth point, this node is split: Two routing objects appear. If a new point is added (by pressing the left mouse button) it will be added to the nearest node (if within a covering radius) or the node whose covering radius has to be extended the least.

You can add random points by clicking on one of the "100/1000/10000" buttons. Make sure you turn of the display of the tree structure before (also before turning on the search), because it will be slow the whole thing down a lot and will sometime account for 99.9% of the time.

You can watch the system drawing the tree if you press the right mouse button.

What do the circles mean? Imagine you look on a tree from below (you're lying under the earth...). The thicker and the darker the lines are, and the larger the red boxes are, the nearer the routing objects are. Routing objects may be placed on top of each other, in which case you see a lighter circle with a darker border (lower objects are painted later). Each routing object contains all of its childs within the bounding circle. The "print tree" button prints the tree into a console window. Since a Java plugin applet doesn't have a console, this won't do anything.

When the tree works best (and when not)

The tree works best the more subtrees can be pruned. This will definitely be the case if you have different distinct clusters of objects. The worst case is random points as shown here.
But the MTree is the only known index structure for metric spaces. Its main problem is the overlap between different routing objects in the same level. An extension of the MTree exists, the SlimTree, which aims for reducing this amount of overlap. For n-dimensional spaces, there are better indexing methods, such as the R* tree.

Comparing this Implementation to other ones

The only Java implementation of an MTree that I know of was done in the XXL library project of the University of Marburg. This implementation, however, stresses an "elegant" design over speed. I found myself waiting for 24 hours to cluster 80.000 strings with the dbScan algorithm, which was unacceptable. My choice was to move to C++ (which would have caused me a lot of headache) or translate the C++ code to Java (which I did). I haven't made exact comparisons yet, but from what I have seen I conclude that the implementation shown here is about 10-15 times faster than Marburg's.

Indexing speed: The speed of the index is largely dependent on the speed of the distance function. On my 1000 MHz laptop and a 1.3.1 server VM I was able to add 10.000 points of the above example in about 2.5 seconds, after the JIT compiler was through. The levenshtein application, however, was a lot slower: It needed about 1.7 ms to add one string to the index with a nodesize of 5 (this number grows logarithmically as the tree grows). The time goes up very fast if the node size is increased (about 17 ms with node size 40)

Query speed: For queries with 2-dimensional objects, the query speed is very fast, as you can see with the above example (in the millisecond range even with a lot of objects). A similarity search in a collection of 300.000 strings and the configuration above took between 100 and 700 ms, which is still by far slower than I would have expected. I see room for improvement only in the indexing part.

I will have to compile numbers more thoroughly sometimes to provide more exact results.

Note that these figures can be a little worse in the applet because the client VM is used. Also, the display accounts for much of the delay, since it is recomputed at every mouse move (in search mode) or mouse click.

Weaknesses of the current Java implementation


Last Updated: 25-Jun-2002