Spaces and Distances

Below, there is a list of nearly all spaces (a space is a combination of data and the distance). The mnemonic name of a space is passed to python bindings function as well as to the benchmarking utility experiment. When initializing the space in Python embeddings, please use the type FLOAT for all spaces, except the space leven: see the description here. A more detailed description is given in the manual.

Specifying parameters of the space

In some rare cases, spaces have parameters Currently, this is done only for L_p and Renyi divergences. Specifying parameters is done differently in Python bindings and the command line utility experiment. In Python bindings, one needs to specify a dictionary of space parameters, see this notebook for an example. In the command line, parameter specifiers go after the colon sign. For example, lp:p=3 denotes the L₃ space and lp:p=2 is a synonym for the Euclidean, i.e., L₂ space. Should we ever have more than one parameter, their specifications are going to be separated by commas.

Fast, Slow, and Approximate variants

There can be more than one version of a distance function, which have different space-performance trade-off. In particular, for distances that require computation of logarithms we can achieve an order of magnitude improvement (e.g., for the GNU C++ and Clang) by pre-computing logarithms at index time. This comes at a price of extra storage. In the case of Jensen-Shannon distance functions, we can precompute some of the logarithms and accurately approximate those we cannot precompute. Fast versions of inner-product distances, however, do not precompute anything. They employ fast SIMD instructions and a special data layout.

Straightforward slow implementations of the distance functions may have the substring slow in their names, while faster versions contain the substring fast. Fast functions that involve approximate computations contain additionally the substring approx. For non-symmetric distance function, a space may have two variants: one variant is for left queries (the data object is the first, i.e., left, argument of the distance function while the query object is the second argument) and another is for right queries (the data object is the second argument and the query object is the first argument). In the latter case the name of the space ends on rq.

Input Format

For Python bindings, all dense-vector spaces require float32 numpy-array input (two-dimensional). See an example here. One exception is the squared Euclidean space for SIFT vectors, which requires input as uint8 integer numpy arrays. An example can be found here.

For sparse spaces that include the L_p-spaces, the sparse cosine similarity, and the maximum-inner product space, the input data is a sparse scipy matrix. An example can be found here.

In all other cases, including the sparse Jaccard space, the bit Hamming and the bit Jaccard space, as well as the Levenshtein spaces, the input data comes as an array of strings (one string per data point). All vector data is represented by space-separated numbers. For binary vectors, there will be ones and zeros separated by spaces. An sparse Jaccard example can be found here.

For Levenshtein spaces, strings are expected to be ASCII strings (it is currently a limitation). You can pass a UTF8-encoded string, but the distance will be sometimes larger than the actual distance.

Storage Format

For dense vector spaces, the data can be either single-precision or double-precision floating-point numbers. However, double-precision has not been useful so far and we do not recommend use it. In the case of SIFT vectors, though, vectors are stored more compactly: as 8-bit integer numbers (Uint8).

For sparse vector spaces, we keep a float/double vector value and a 32-bit dimension number/ID. However, for fast variants of sparse spaces we use only 32-bit single-precision floats. Fast vector spaces are segmented into chunks each containing 65536 IDs. Thus, if vectors have significant gaps between dimensions, such storage can be suboptimal.

Bit variants of the spaces (bit-Hamming and bit-Jaccard) are stored compactly as bit-vectors (each vector is an array of 32-bit integers).

L_p spaces

L_p spaces is a classic family of spaces, which include the Euclidean (L₂) distance. They are metrics if p is at least one. The value of p can be fractional. However, computing fractional powers is slower with most math libraries. Our implementation is more efficient when fractions are in the form n/2^k, where k is a small integer.

Space code	Description and Notes
`lp`	Generic L_p space with a parameter p
`l1`	L₁
`l2`	Euclidean space
`linf`	L_∞
`l2sqr_sift`	Euclidean distance for SIFT vectors (Uint8 storage)
`lp_sparse`	sparse L_p space
`l1_sparse`	sparse L₁
`l2_sparse`	sparse Euclidean space
`linf_sparse`	sparse L_∞

Inner-product Spaces

Like L_p spaces, inner-product based spaces exist in sparse and dense variants. Note that the cosine distance is defined as one minus the value of the cosine similarity. The cosine similarity is a normalized inner/scalar product. In contrast, the angular distance is a monotonically transformed cosine similarity. The resulting transformation is a true metric distance. The fast and slow variants of the sparse inner-product based distances differ in that the faster versions rely on SIMD and a special data layout.

Space code	Description and Notes
`cosinesimil`	dense cosine distance
`negdotprod`	dense negative inner-product (for maximum inner-product search)
`angulardist`	dense angular distance
`cosinesimil_sparse`, `cosinesimil_sparse_fast`	sparse cosine distance
`negdotprod_sparse`, `negdotprod_sparse_fast`	sparse negative inner-product
`angulardist_sparse`, `angulardist_sparse_fast`	sparse angular distance

Divergences

The divergences that we used are defined only for the dense vector spaces. The faster and approximate version of these distances normally rely on precomputation of logarithms, which doubles the storage requirements. We implement distances for regular and generalized KL-divergence, the Jensen-Shannon (JS) divergence, for the family of Renyi divergences, and for the Itakura-Saito distance. We also explicitly implement the squared JS-divergence, which is a true metric distance.

For the meaning of infixes fast, slow, approx, and rq see the information above.

Space code(s)	Description and Notes
`kldivfast`, `kldivfastrq`	Regular KL-divergence
`kldivgenslow`, `kldivgenfast`, `kldivgenfastrq`	Generalized KL-divergence
`itakurasaitoslow`, `itakurasaitofast`, `itakurasaitofastrq`	Itakura-Saito distance
`jsdivslow`, `jsdivfast`, `jsdivfastapprox`	JS-divergence
`jsmetrslow`, `jsmetrfast`, `jsmetrfastapprox`	JS-metric
`renyidiv_slow`, `renyidiv_fast`	Renyi divergence: parameter name `alpha`

Miscellaneous Distances

The Jaccard distance and the Levenshtein distance, and the Hamming distances are all true metrics. However, the normalized Levenshtein distance is mildly non-metric.

Space code	Description and Notes
`leven`	Levenshtein distance: requires an integer distance value
`normleven`	the Levenshtein distance normalized by string lengths: requires a float distance value
`jaccard_sparse`	sparse Jaccard distance
`bit_jaccard`	bit Jaccard distance
`bit_hamming`	bit Hamming distance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spaces.md

spaces.md

Spaces and Distances

Specifying parameters of the space

Fast, Slow, and Approximate variants

Input Format

Storage Format

L_p spaces

Inner-product Spaces

Divergences

Miscellaneous Distances

Files

spaces.md

Latest commit

History

spaces.md

File metadata and controls

Spaces and Distances

Specifying parameters of the space

Fast, Slow, and Approximate variants

Input Format

Storage Format

Lp spaces

Inner-product Spaces

Divergences

Miscellaneous Distances

L_p spaces