-
Notifications
You must be signed in to change notification settings - Fork 166
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
10 changed files
with
296 additions
and
155 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
data/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
|
||
|
||
|
||
data/: | ||
mkdir -p $@ | ||
|
||
data/sift: data/ | ||
curl -o data/sift.tar.gz ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz | ||
tar -xvzf data/sift.tar.gz -C data/ | ||
rm data/sift.tar.gz | ||
|
||
data/gist: data/ | ||
curl -o data/gist.tar.gz ftp://ftp.irisa.fr/local/texmex/corpus/gist.tar.gz | ||
tar -xvzf data/gist.tar.gz -C data/ | ||
rm data/gist.tar.gz |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,35 +1,25 @@ | ||
``` | ||
python3 bench/bench.py \ | ||
-n "sift1m" \ | ||
-i sift/sift_base.fvecs \ | ||
-q sift/sift_query.fvecs \ | ||
--sample 10000 --qsample 100 \ | ||
-k 10 | ||
``` | ||
|
||
``` | ||
python3 bench.py \ | ||
-n "sift1m" \ | ||
-i ../../sift/sift_base.fvecs \ | ||
-q ../../sift/sift_query.fvecs \ | ||
--qsample 100 \ | ||
-k 20 | ||
``` | ||
``` | ||
python3 bench.py \ | ||
-n "sift1m" \ | ||
-i ../../sift/sift_base.fvecs \ | ||
-q ../../sift/sift_query.fvecs \ | ||
--qsample 100 \ | ||
-x faiss,vec-scalar.4096,vec-static,vec-vec0.4096.16,vec-vec0.8192.1024,usearch,duckdb,hnswlib,numpy \ | ||
-k 20 | ||
``` | ||
|
||
|
||
|
||
``` | ||
python bench.py -n gist -i ../../gist/gist_base.fvecs -q ../../gist/gist_query.fvecs --qsample 100 -k 20 --sample 500000 -x faiss,vec-static,vec-scalar.8192,vec-scalar.16384,vec-scalar.32768,vec-vec0.16384.64,vec-vec0.16384.128,vec-vec0.16384.256,vec-vec0.16384.512,vec-vec0.16384.1024,vec-vec0.16384.2048 | ||
``` | ||
|
||
|
||
python bench.py -n gist -i ../../gist/gist_base.fvecs -q ../../gist/gist_query.fvecs --qsample 100 -k 20 --sample 500000 -x faiss,vec-static,sentence-transformers,numpy | ||
# `sqlite-vec` In-memory benchmark comparisions | ||
|
||
This repo contains a benchmarks that compares KNN queries of `sqlite-vec` to other in-process vector search tools using **brute force linear scans only**. These include: | ||
|
||
|
||
- [Faiss IndexFlatL2](https://faiss.ai/) | ||
- [usearch with `exact=True`](https://github.com/unum-cloud/usearch) | ||
- [libsql vector search with `vector_distance_cos`](https://turso.tech/vector) | ||
- [numpy](https://numpy.org/), using [this approach](https://github.com/EthanRosenthal/nn-vs-ann) | ||
- [duckdb with `list_cosine_similarity`](https://duckdb.org/docs/sql/functions/nested.html#list_cosine_similaritylist1-list2) | ||
- [`sentence_transformers.util.semantic_search`](https://sbert.net/docs/package_reference/util.html#sentence_transformers.util.semantic_search) | ||
- [hnswlib BFIndex](https://github.com/nmslib/hnswlib/blob/c1b9b79af3d10c6ee7b5d0afa1ce851ae975254c/TESTING_RECALL.md?plain=1#L8) | ||
|
||
|
||
Again **ONLY BRUTE FORCE LINEAR SCANS ARE TESTED**. This benchmark does **not** test approximate nearest neighbors (ANN) implementations. This benchmarks is extremely narrow to just testing KNN searches using brute force. | ||
|
||
A few other caveats: | ||
|
||
- Only brute-force linear scans, no ANN | ||
- Only CPU is used. The only tool that does offer GPU is Faiss anyway. | ||
- Only in-memory datasets are used. Many of these tools do support serializing and reading from disk (including `sqlite-vec`) and possibly `mmap`'ing, but this only tests in-memory datasets. Mostly because of numpy | ||
- Queries are made one after the other, **not batched.** Some tools offer APIs to query multiple inputs at the same time, but this benchmark runs queries sequentially. This was done to emulate "server request"-style queries, but multiple users would send queries at different times, making batching more difficult. To note, `sqlite-vec` does **not** support batched queries yet. | ||
|
||
|
||
These tests are run in Python. Vectors are provided as an in-memory numpy array, and each test converts that numpy array into whatever makes sense for the given tool. For example, `sqlite-vec` tests will read those vectors into a SQLite table. DuckDB will read them into an Array array then create a DuckDB table from that. |
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.