Home

Supercluster.KDTree

This is a KD-Tree written completely in C#. KDTrees and similar data structures can provide amazing performance benefits for nearest neighbor and spatial searches when used correctly. This page is to help you figure out if you should be using a KDTree.

When to use a KDTree

KD-Trees are a type of data structure that can greatly improve the performance of nearest neighbor searches. By "greatly improve" I mean way faster than a linear search (iterating through every point, one by one). This is only true if you have a sufficient amount of data.

The amount of data you need for a KDTree to be useful depends on the dimensionality of your data of your data. The wikipedia article on KDTrees states "As a general rule, if the dimensionality is k, the number of points in the data, N, should be N >> 2^k." (The double ">" means much greater than.)

If you violate this rule then KDTree search performance degrades into linear search performance. In practice, a malformed KDTree may perform worse than a linear search as modern processors have various tricks to speed up iteration through arrays such as automatic vectorization, branch prediction and loop exchanging.

How many dimensions can I use?

That depends on the number of sample points you have. The number of points you need to achieve O(log n) search performance roughly doubles with each dimension. As a general rule, if your dimensionality is less than 10, and you have more than 10,000 points you will probably be fine. If your dimensionality is greater than 20, you will need roughly 32 million points to get normal O(log n) like performance. Double 32 million for every dimension over 20.

It may seem surprising how fast the KDTree because useless under high dimensions. This is true of almost all spatial data structures. This phenomenon is known as the curse of dimensionality

"But my data has like, 5000 dimensions! Help!"

If your data has many dimensions, and you need to perform nearest neighbor searches quickly, then you should probably look into to a data structure which allows approximate nearest neighbor searches. These methods usually produce the true nearest neighbor roughly more than 70% of the time and are much faster than a linear search. As a last piece of advice, a KDTree is a serial data structure (i.e., it is not parallelized). Linear searches can easily be made parallel for a quick performance boost.

"No problem, I have an endless supply of data."

If you are going to use this library on huge amounts of data, then make sure to enable the following garbage collection options your programs app.config file:

   <runtime>
     <gcServer enabled="true"/>
     <gcAllowVeryLargeObjects enabled="true" />
   </runtime>

Documentation and Tutorials:

MSDN Style Documentation: http://mathferret1013.github.io/Supercluster.KDTree
Nuget Package: Install-Package Supercluster.KDTree

Provide feedback

Saved searches