From f7730ac12e4530d104a29e458b89de5e08b6a86b Mon Sep 17 00:00:00 2001 From: Alex Garcia Date: Fri, 15 Nov 2024 10:40:44 -0800 Subject: [PATCH 1/7] experimental metadata docs --- site/metadata-beta.md | 299 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 299 insertions(+) create mode 100644 site/metadata-beta.md diff --git a/site/metadata-beta.md b/site/metadata-beta.md new file mode 100644 index 0000000..4ebed37 --- /dev/null +++ b/site/metadata-beta.md @@ -0,0 +1,299 @@ +# Experimental Metadata Filtering Builds + +The `sqlite-vec` project has a series of pull requests +([#122](https://github.com/asg017/sqlite-vec/pull/122), +[#123](https://github.com/asg017/sqlite-vec/pull/123), and +[#124](https://github.com/asg017/sqlite-vec/pull/124)) that will add proper +metadata column support to `vec0` virtual tables. + +But they aren't merged yet! So I've packaged pre-compiled extensions with these +features baked in, so that others can try it for themselves. Once those pull +requests are merged, this page will be removed. + +As a quick sample, this is what metadata columns will look like: + +```sql +create virtual table vec_movies using vec0( + -- aliased primary key + movie_id integer primary key, + + -- vector column + synopsis_embedding float[1024], + + -- partition key (internally shards vectors) + user_id integer primary key, + + -- metadata columns (indexed alongside vectors) + genre text, + num_reviews int, + mean_rating float, + + -- auxiliary columns (not indexed) + +synopsis text +); + +select + movie_id, + title, + genre, + num_reviews, + mean_rating, + distance +from vec_movies +where synopsis_embedding match '[...]' + and genre = 'scifi' + and num_reviews between 100 and 500 + and mean_rating > 3.5 + and k = 5; +/* +┌──────────┬─────────────────────┬─────────┬─────────────┬──────────────────┬──────────┐ +│ movie_id │ title │ genre │ num_reviews │ mean_rating │ distance │ +├──────────┼─────────────────────┼─────────┼─────────────┼──────────────────┼──────────┤ +│ 13 │ 'The Matrix' │ 'scifi' │ 423 │ 4.5 │ 2.5 │ +│ 18 │ 'Inception' │ 'scifi' │ 201 │ 5.0 │ 2.5 │ +│ 21 │ 'Gravity' │ 'scifi' │ 342 │ 4.0 │ 5.5 │ +│ 22 │ 'Dune' │ 'scifi' │ 451 │ 4.40000009536743 │ 6.5 │ +│ 8 │ 'Blade Runner 2049' │ 'scifi' │ 301 │ 5.0 │ 7.5 │ +└──────────┴─────────────────────┴─────────┴─────────────┴──────────────────┴──────────┘ +``` + +## Install + +To try it out youself, download one of the following ZIP files that contain +pre-compiled SQLite extensions. You can manually load them into your +Python/JavaScript/Ruby/etc. projects to try things out. + +| Platform | Link | +| ------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| MacOS ARM | [`sqlite-vec-macos-aarch64-extension.zip`](https://fly.storage.tigris.dev/sqlite-vec-public-static/metadata-filtering-beta/v1-052ba4b/sqlite-vec-macos-aarch64-extension.zip) | +| MacOS x86_64 | [`sqlite-vec-macos-x86_64-extension.zip`](https://fly.storage.tigris.dev/sqlite-vec-public-static/metadata-filtering-beta/v1-052ba4b/sqlite-vec-macos-x86_64-extension.zip) | +| Linux ARM | [`sqlite-vec-linux-aarch64-extension.zip`](https://fly.storage.tigris.dev/sqlite-vec-public-static/metadata-filtering-beta/v1-052ba4b/sqlite-vec-linux-aarch64-extension.zip) | +| Linux x86_64 | [`sqlite-vec-linux-x86_64-extension.zip`](https://fly.storage.tigris.dev/sqlite-vec-public-static/metadata-filtering-beta/v1-052ba4b/sqlite-vec-linux-x86_64-extension.zip) | +| Windows x86_64 | [`sqlite-vec-windows-x86_64-extension.zip`](https://fly.storage.tigris.dev/sqlite-vec-public-static/metadata-filtering-beta/v1-052ba4b/sqlite-vec-windows-x86_64-extension.zip) | +| Cosmopolitan (`sqlite3` CLI with `sqlite-vec` baked in) | [`sqlite-vec-cosmopolitan.zip`](https://fly.storage.tigris.dev/sqlite-vec-public-static/metadata-filtering-beta/v1-052ba4b/sqlite-vec-cosmopolitan.zip) | + +To check which experimental version you are on, run `SELECT vec_version()`. The +most recent version is `v-metadata-experiment.01`. + +The rest of this document is documentation about how to use these new metadata, +auxiliary, and partition columns in these experimental builds. + +## Experimental Status + +This work isn't complete yet, so there are some subtle bugs and TODOs: + +- You cannot `UPDATE` a `PARTITION KEY` value yet. +- KNN queries with a `WHERE` constraint on a `TEXT` metadata column that's + longer than `12` characters will fail. +- `NULL` values are not allowed on metadata columns +- `PARTITION KEY` columns only support `=` operators currently, but `!=`, `<=`, `>=`, `<`, and `>` will operators will be supported. + +These will be fixed before the official release. + +## Metadata in `vec0` Virtual Tables + +There are three ways to store non-vector columns in `vec0` virtual tables: +metadata columns, partition keys, and auxiliary columns. Each options has their +own benefits and limitations. + +```sql +create virtual table vec_chunks using vec0( + document_id integer partition key, + contents_embedding float[768], + + -- partition key column, denoted by 'partition key' + user_id integer partition key, + + -- metadata column, appears as normal column definition + label text, + + -- auxiliary column, denoted by '+' + +contents text +); +``` + +A quick summary of each option: + +| Column Type | Description | Benefits | Limitations | +| ----------------- | ----------------------------------------------------------------------- | ---------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------- | +| Metadata columns | Stores boolean, integer, floating point, or text data alongside vectors | Can be included in the `WHERE` clause of a KNN query | Slower full scan, slightly inefficient with long strings (`> 12` characters) | +| Auxiliary columns | Stores any kind of data in a separate internal table | Eliminates need for an external `JOIN` | Cannot appear in the `WHERE` clause of a KNN query | +| Partition Key | Internally shards vector index on a given key | Make selective queries much faster | Can cause oversharding and slow KNN if not used carefully. Should be +100's of vectors per unique partition key value | + +### Metadata Columns + +Metadata columns are extra "regular" columns that you can include in a `vec0` +table definition. These columns will be indexed along with declared vector +columns, and allow you to include extra `WHERE` constraints during KNN queries. + +```sql +create virtual table vec_movies using vec0( + movie_id integer primary key, + synopsis_embedding float[1024], + genre text, + num_reviews int, + mean_rating float, + contains_violence boolean +); +``` + +In the `vec0` constructor, the `genre`, `num_reviews`, `mean_rating`, and +`contains_violence` columns are metadata columns, with their specified type. + +A sample KNN query on this table could look like: + +```sql +select * +from vec_movies +where synopsis_embedding match '[...]' + and k = 5 + and genre = 'scifi' + and num_reviews between 100 and 500 + and mean_rating > 3.5 + and contains_violence = false; +``` + +The first two conditions in the `WHERE` clause (`synopsis_embedding match` and +`k = 5`) denote that the query in a KNN query. The other conditions are metadata +constraints, that `sqlite-vec` will recognize and apply during the KNN +calculation. In other words, for the above query, a maximum of 5 rows would be +returned, all of which would fit under all the `WHERE` constraints for their +metadata column values. + +#### Metadata Column Declaration + +Metatadata columns are declared in the `vec0` constructor just like regular column definitions, with the column name first then the column type. + +Only the following column types are supported in metadata columns. All these +columns are strictly typed. + +- `TEXT` for text and strings +- `INTEGER` for 8-byte integers +- `FLOAT` for 8-byte floating-point numbers +- `BOOLEAN` for 1-bit `0` or `1` + +Other column types may be supported in the future. Column type names are case +insensitive. + +Additional column constraints like `UNIQUE` or `NOT NULL` are not supported. + +A maximum of 16 metadata columns can be declared in a `vec0` virtual table. + + +#### Supported operations + +Metadata column `WHERE` conditions in a KNN query will only work on the +following operators: + +- `=` Equals to +- `!=` Not equals to +- `>` Greater than +- `>=` Greater than or equal to +- `<` Less than +- `<=` Less than or equal to + +Using any other operator like `IS NULL`, `LIKE`, `GLOB`, `REGEXP`, or any scalar +function will result in an error or incorrect results. + +Boolean columns only support `=` and `!=` operators. + +### Partition Key Columns + +Partition key columns allow one to internally shard a vector indexed based on a given key. Any `=` constraint in a `WHERE` clause on a partition key column will + +For example, say you're performing vector search on a large dataset of documents. However, each document belongs to a user, and users can only search their own documents. It would be wasteful to perform a brute-force over all documents if you only care about 1 user at a time. So, you can partition the vector index based on user ID like so: + +```sql +create virtual table vec_documents using vec0( + document_id integer primary key, + user_id integer partition key, + contents_embedding float[1024] +) +``` + +Then during a KNN query, you can constrain results to a specific user in the `WHERE` clause like so: + +```sql +select + document_id, + user_id, + distance +from vec_documents +where contents_embedding match :query + and k = 20 + and user_id = 123; +``` + +`sqlite-vec` will recognize the `user_id = 123` constraint and pre-filter vectors during a KNN search. Vectors with the same partition key values are collocated together, so this is a fast operation. + +Another example: say you're performing vector search on a large dataset of news headlines of the past 100 years. However, in your application, most users only want to search a subset of articles based on when they were written, like "in the past ten years" or "during the obama administration." You can paritition based on published date like so: + +```sql +create virtual table vec_articles using vec0( + article_id integer primary key, + published_date text partition key, + headline_embedding float[1024] +); +``` + +And a KNN query: + +```sql +select + article_id, + published_date, + distance +from vec_articles +where headline_embedding match :query + and published_date between '2009-01-20' and '2017-01-20'; -- Obama administration +``` + +But be careful! over-using partition key columns can lead to over-sharding and slower KNN queries. As a rule of thumb, make sure that every unique partition key value has ~100's of vectors associated with it. In the above examples, make sure that every user has on the magnitude of dozens or hundreds of documents each, or that every article has dozens or hundreds of articles per day. If they don't and you're noticing slow queries, try a more broad partition key value, like `organization_id` or `published_month`. + +A maximum of 4 partition key columns can be declared in a `vec0` virtual table, but use caution if you find yourself using more than 1. Vectors are sharded along each unique combination, so over-sharding is more common with more partition key columns. + +### Auxiliary Columns + +Auxiliary columns store additional unindexed data separate from the internal vector index. They are meant for larger metadata that will never appear in a `WHERE` clause of a KNN query, eliminating the need for a separate `JOIN`. + +Auxiliary columns are denoted by a `+` prefix in their column definition, like so: + +```sql +create virtual table vec_chunks using vec0( + contents_embedding float[1024], + +contents text +); + +select + rowid, + contents, + distance +from vec_chunks +where contents_embedding match :query + and k = 10; +``` + +Here we store the text contents of each chunk in the `contents` auxiliary column. When we perform a KNN query, we can reference the `contents` column in the `SELECT` clause, to get the raw text contents of the most relevant chunks. + +A similar approach can be used for image embeddings: + +```sql +create virtual table vec_image_chunks using vec0( + image_embedding float[1024], + +image blob +); + +select + rowid, + contents, + distance +from vec_chunks +where contents_embedding match :query + and k = 10; +``` + +Here the `image` auxiliary column can store the raw image file in a large `BLOB` column. It can appear in the `SELECT` clause of the KNN query, to get the most relevant raw images. + +In general, auxiliary columns are good for large text, blobs, URLs, or other datatypes that won't be a part of a `WHERE` clause of a KNN query. If you column will often appear in a `SELECT` clause but not the `WHERE` clause, then auxiliary columns are a good fit. + +A maximum of 16 auxiliary columns can be declared in a `vec0` virtual table. From 04c6da4c628071dc9322d1a9786dc793837f3479 Mon Sep 17 00:00:00 2001 From: Alex Garcia Date: Fri, 15 Nov 2024 10:43:44 -0800 Subject: [PATCH 2/7] update linux arm builds --- .github/workflows/release.yaml | 16 ++++++++++++++++ .github/workflows/test.yaml | 12 ------------ 2 files changed, 16 insertions(+), 12 deletions(-) diff --git a/.github/workflows/release.yaml b/.github/workflows/release.yaml index e6057d0..a977b38 100644 --- a/.github/workflows/release.yaml +++ b/.github/workflows/release.yaml @@ -55,6 +55,18 @@ jobs: with: name: sqlite-vec-windows-x86_64-extension path: dist/* + build-linux-aarch64-extension: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - run: sudo apt-get install gcc-arm-linux-gnueabihf + - run: ./scripts/vendor.sh + - run: make sqlite-vec.h + - run: make CC=arm-linux-gnueabihf-gcc loadable static + - uses: actions/upload-artifact@v4 + with: + name: sqlite-vec-linux-aarch64-extension + path: dist/* build-cosmopolitan: runs-on: macos-latest permissions: @@ -190,6 +202,10 @@ jobs: with: name: sqlite-vec-linux-x86_64-extension path: dist/linux-x86_64 + - uses: actions/download-artifact@v4 + with: + name: sqlite-vec-linux-aarch64-extension + path: dist/linux-aarch64 - uses: actions/download-artifact@v4 with: name: sqlite-vec-macos-x86_64-extension diff --git a/.github/workflows/test.yaml b/.github/workflows/test.yaml index a601521..abb8490 100644 --- a/.github/workflows/test.yaml +++ b/.github/workflows/test.yaml @@ -141,12 +141,7 @@ jobs: build-linux-aarch64-extension: runs-on: ubuntu-latest steps: - - uses: green-coding-solutions/eco-ci-energy-estimation@v4 - with: - task: start-measurement - uses: actions/checkout@v4 - with: - version: "latest" - run: sudo apt-get install gcc-arm-linux-gnueabihf - run: ./scripts/vendor.sh - run: make sqlite-vec.h @@ -155,13 +150,6 @@ jobs: with: name: sqlite-vec-linux-aarch64-extension path: dist/* - - uses: green-coding-solutions/eco-ci-energy-estimation@v4 - with: - task: get-measurement - label: "all" - - uses: green-coding-solutions/eco-ci-energy-estimation@v4 - with: - task: display-results build-wasm32-emscripten: runs-on: ubuntu-latest steps: From e412860897c5a03171733b2cab2e8e6fcc315c6d Mon Sep 17 00:00:00 2001 From: Alex Garcia Date: Fri, 15 Nov 2024 10:51:26 -0800 Subject: [PATCH 3/7] v0.1.4-alpha.3 --- VERSION | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/VERSION b/VERSION index 8af76f5..8bffce1 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.1.4-alpha.2 \ No newline at end of file +0.1.4-alpha.3 \ No newline at end of file From 67f8ff8517815df78da322bd09b4b375226c2aed Mon Sep 17 00:00:00 2001 From: Alex Garcia Date: Fri, 15 Nov 2024 11:02:10 -0800 Subject: [PATCH 4/7] v0.1.4 --- VERSION | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/VERSION b/VERSION index 8bffce1..446ba66 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.1.4-alpha.3 \ No newline at end of file +0.1.4 \ No newline at end of file From 9780f6d445f63389f319fe3defa5ec0611952397 Mon Sep 17 00:00:00 2001 From: Alex Garcia Date: Fri, 15 Nov 2024 11:17:55 -0800 Subject: [PATCH 5/7] bump dist to fix linux arm builds --- .github/workflows/release.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/release.yaml b/.github/workflows/release.yaml index a977b38..dd88811 100644 --- a/.github/workflows/release.yaml +++ b/.github/workflows/release.yaml @@ -255,7 +255,7 @@ jobs: name: sqlite-vec-iossimulator-x86_64-extension path: dist/iossimulator-x86_64 - run: | - curl -L https://github.com/asg017/sqlite-dist/releases/download/v0.0.1-alpha.16/sqlite-dist-x86_64-unknown-linux-gnu.tar.xz \ + curl -L https://github.com/asg017/sqlite-dist/releases/download/v0.0.1-alpha.17/sqlite-dist-x86_64-unknown-linux-gnu.tar.xz \ | tar xfJ - --strip-components 1 - run: make sqlite-vec.h - run: ./sqlite-dist ./sqlite-dist.toml --input dist/ --output distx/ --version $(cat VERSION) From 5183ab4b345f39a526620812c19340d673a43696 Mon Sep 17 00:00:00 2001 From: Alex Garcia Date: Fri, 15 Nov 2024 11:18:19 -0800 Subject: [PATCH 6/7] v0.1.5-alpha.1 --- VERSION | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/VERSION b/VERSION index 446ba66..8a0b646 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.1.4 \ No newline at end of file +0.1.5-alpha.1 \ No newline at end of file From ee3654701f7b8efe4802ff1caed24514f43443dd Mon Sep 17 00:00:00 2001 From: Alex Garcia Date: Fri, 15 Nov 2024 11:22:50 -0800 Subject: [PATCH 7/7] v0.1.5 --- VERSION | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/VERSION b/VERSION index 8a0b646..def9a01 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.1.5-alpha.1 \ No newline at end of file +0.1.5 \ No newline at end of file