Below, there is a list of nearly all spaces (a space is a combination of data and the distance). The mnemonic name of a space is passed to python bindings function as well as to the benchmarking utility experiment
.
When initializing the space in Python embeddings, please use the type
FLOAT
for all spaces, except the space leven
: see the description here.
A more detailed description is given
in the manual.
In some rare cases, spaces have parameters
Currently, this is done only for
Lp
and Renyi divergences.
Specifying parameters is done differently in Python bindings and the command line utility experiment
.
In Python bindings, one needs to specify a dictionary of space parameters, see this notebook for an example.
In the command line, parameter specifiers go after the colon sign.
For example, lp:p=3
denotes the L3 space and
lp:p=2
is a synonym for the Euclidean, i.e., L2 space.
Should we ever have more than one parameter, their specifications are going to be separated by commas.
There can be more than one version of a distance function, which have different space-performance trade-off. In particular, for distances that require computation of logarithms we can achieve an order of magnitude improvement (e.g., for the GNU C++ and Clang) by pre-computing logarithms at index time. This comes at a price of extra storage. In the case of Jensen-Shannon distance functions, we can precompute some of the logarithms and accurately approximate those we cannot precompute. Fast versions of inner-product distances, however, do not precompute anything. They employ fast SIMD instructions and a special data layout.
Straightforward slow implementations of the distance functions may have the substring slow
in their names, while faster versions contain the substring fast
.
Fast functions that involve approximate computations contain additionally the substring approx
.
For non-symmetric distance function, a space may have two variants: one variant is for left
queries (the data object is the first, i.e., left, argument of the distance function
while the query object
is the second argument)
and another is for right queries (the data object is the second argument and the query object is the first argument).
In the latter case the name of the space ends on rq
.
For Python bindings, all dense-vector spaces require float32 numpy-array input (two-dimensional). See an example here. One exception is the squared Euclidean space for SIFT vectors, which requires input as uint8 integer numpy arrays. An example can be found here.
For sparse spaces that include the Lp-spaces, the sparse cosine similarity, and the maximum-inner product space, the input data is a sparse scipy matrix. An example can be found here.
In all other cases, including the sparse Jaccard space, the bit Hamming and the bit Jaccard space, as well as the Levenshtein spaces, the input data comes as an array of strings (one string per data point). All vector data is represented by space-separated numbers. For binary vectors, there will be ones and zeros separated by spaces. An sparse Jaccard example can be found here.
For Levenshtein spaces, strings are expected to be ASCII strings (it is currently a limitation). You can pass a UTF8-encoded string, but the distance will be sometimes larger than the actual distance.
For dense vector spaces, the data can be either single-precision or double-precision floating-point numbers. However, double-precision has not been useful so far and we do not recommend use it. In the case of SIFT vectors, though, vectors are stored more compactly: as 8-bit integer numbers (Uint8).
For sparse vector spaces, we keep a float/double vector value and a 32-bit dimension number/ID. However, for fast variants of sparse spaces we use only 32-bit single-precision floats. Fast vector spaces are segmented into chunks each containing 65536 IDs. Thus, if vectors have significant gaps between dimensions, such storage can be suboptimal.
Bit variants of the spaces (bit-Hamming and bit-Jaccard) are stored compactly as bit-vectors (each vector is an array of 32-bit integers).
Lp spaces is a classic family of spaces, which include the Euclidean (L2) distance. They are metrics if p is at least one. The value of p can be fractional. However, computing fractional powers is slower with most math libraries. Our implementation is more efficient when fractions are in the form n/2k, where k is a small integer.
Space code | Description and Notes |
---|---|
lp |
Generic Lp space with a parameter p |
l1 |
L1 |
l2 |
Euclidean space |
linf |
L∞ |
l2sqr_sift |
Euclidean distance for SIFT vectors (Uint8 storage) |
lp_sparse |
sparse Lp space |
l1_sparse |
sparse L1 |
l2_sparse |
sparse Euclidean space |
linf_sparse |
sparse L∞ |
Like Lp spaces, inner-product based spaces exist in sparse and dense variants. Note that the cosine distance is defined as one minus the value of the cosine similarity. The cosine similarity is a normalized inner/scalar product. In contrast, the angular distance is a monotonically transformed cosine similarity. The resulting transformation is a true metric distance. The fast and slow variants of the sparse inner-product based distances differ in that the faster versions rely on SIMD and a special data layout.
Space code | Description and Notes |
---|---|
cosinesimil |
dense cosine distance |
negdotprod |
dense negative inner-product (for maximum inner-product search) |
angulardist |
dense angular distance |
cosinesimil_sparse , cosinesimil_sparse_fast |
sparse cosine distance |
negdotprod_sparse , negdotprod_sparse_fast |
sparse negative inner-product |
angulardist_sparse , angulardist_sparse_fast |
sparse angular distance |
The divergences that we used are defined only for the dense vector spaces. The faster and approximate version of these distances normally rely on precomputation of logarithms, which doubles the storage requirements. We implement distances for regular and generalized KL-divergence, the Jensen-Shannon (JS) divergence, for the family of Renyi divergences, and for the Itakura-Saito distance. We also explicitly implement the squared JS-divergence, which is a true metric distance.
For the meaning of infixes fast
, slow
, approx
, and rq
see the information above.
Space code(s) | Description and Notes |
---|---|
kldivfast , kldivfastrq |
Regular KL-divergence |
kldivgenslow , kldivgenfast , kldivgenfastrq |
Generalized KL-divergence |
itakurasaitoslow , itakurasaitofast , itakurasaitofastrq |
Itakura-Saito distance |
jsdivslow , jsdivfast , jsdivfastapprox |
JS-divergence |
jsmetrslow , jsmetrfast , jsmetrfastapprox |
JS-metric |
renyidiv_slow , renyidiv_fast |
Renyi divergence: parameter name alpha |
The Jaccard distance and the Levenshtein distance, and the Hamming distances are all true metrics. However, the normalized Levenshtein distance is mildly non-metric.
Space code | Description and Notes |
---|---|
leven |
Levenshtein distance: requires an integer distance value |
normleven |
the Levenshtein distance normalized by string lengths: requires a float distance value |
jaccard_sparse |
sparse Jaccard distance |
bit_jaccard |
bit Jaccard distance |
bit_hamming |
bit Hamming distance |