Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate search with dense and sparse embedding #110

Open
svenseeberg opened this issue Dec 22, 2024 · 7 comments
Open

Investigate search with dense and sparse embedding #110

svenseeberg opened this issue Dec 22, 2024 · 7 comments
Labels
component:chat Chat Back End enhancement New feature or request prio:medium

Comments

@svenseeberg
Copy link
Member

svenseeberg commented Dec 22, 2024

Investigate if OpenSearch is an option for combined sparse and dense vector search. txtai is another option.

Alternatively, we can use PostgresSQL with pg_vector. A very simple SQL setup can look like this:

CREATE DATABASE document_embeddings;
\c document_embeddings
CREATE TABLE document_chunks (id SERIAL PRIMARY KEY, title TEXT NOT NULL, content TEXT NOT NULL, url VARCHAR(512), embedding vector(384), sparse_embedding vector(1024));
CREATE TABLE documents (id SERIAL PRIMARY KEY, title TEXT NOT NULL, content TEXT NOT NULL, url VARCHAR(512));
CREATE INDEX ON document_chunks USING hnsw (embedding vector_l2_ops);
CREATE INDEX ON document_chunks USING hnsw (sparse_embedding vector_l2_ops);
CREATE INDEX idx_documents_url_btree ON documents (url);
CREATE INDEX idx_chunks_url_btree ON document_chunks (url);
CREATE USER document_embeddings WITH ENCRYPTED PASSWORD 'CHANGEME';
GRANT ALL PRIVILEGES ON DATABASE document_embeddings TO document_embeddings;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO document_embeddings;
GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO document_embeddings;

And a retrieval can then be done with the following query:

SELECT *, ({a} * distance + (1 - {a}) * sparse_distance) AS total_distance FROM (SELECT d.url, d.title, d.content, MIN(c.embedding <-> '{embedding}') AS distance, MIN(c.sparse_embedding <-> '{sparse_vector}') AS sparse_distance FROM document_chunks c LEFT JOIN documents d ON c.url=d.url GROUP BY d.url, d.title, d.content) ORDER BY total_distance ASC LIMIT 10;"
@svenseeberg svenseeberg added the component:chat Chat Back End label Dec 22, 2024
@svenseeberg svenseeberg added the enhancement New feature or request label Dec 22, 2024
@svenseeberg svenseeberg changed the title Investigate OpenSearch with dense and sparse embedding Investigate search with dense and sparse embedding Dec 22, 2024
@svenseeberg
Copy link
Member Author

OpenSearch supports multiple embedding models: https://opensearch.org/docs/latest/ml-commons-plugin/pretrained-models/

@svenseeberg
Copy link
Member Author

OpenSearch Salt state: https://git.tuerantuer.org/DF/salt/pulls/290

@svenseeberg
Copy link
Member Author

svenseeberg commented Jan 13, 2025

Basic Settings

Request

curl https://localhost:9200/_cluster/settings -X PUT -ku admin:$SECRET -H "Content-Type: application/json" -d '{
  "persistent": {
    "plugins.ml_commons.only_run_on_ml_node": "false",
    "plugins.ml_commons.model_access_control_enabled": "true",
    "plugins.ml_commons.native_memory_threshold": "99"
  }
}'

Response

{"acknowledged":true,"persistent":{"plugins":{"ml_commons":{"only_run_on_ml_node":"false","model_access_control_enabled":"true","native_memory_threshold":"99"}}},"transient":{}}

Create model group

curl https://localhost:9200/_plugins/_ml/model_groups/_register -X POST -ku admin:$SECRET -H "Content-Type: application/json" -d '{
  "name": "integreat-chat-2025-01-13",
  "description": "paraphrase-multilingual-MiniLM-L12-v2 and opensearch-neural-sparse-encoding-doc-v2-distill"
}'

Response

{"model_group_id":"6_EiYJQB4cWsKJW0CM0k","status":"CREATED"}

Register & Deploy Dense Embedding Model

Request

curl https://localhost:9200/_plugins/_ml/models/_register -X POST -ku admin:$SECRET -H "Content-Type: application/json" -d '{
  "name": "huggingface/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
  "version": "1.0.1",
  "model_group_id": "6_EiYJQB4cWsKJW0CM0k",
  "model_format": "TORCH_SCRIPT"
}'

Response:

{"task_id":"7fE7YJQB4cWsKJW0VM0g","status":"CREATED"}

Request (get Task status)

curl https://localhost:9200/_plugins/_ml/tasks/7fE7YJQB4cWsKJW0VM0g -ku admin:$SECRET

Response

{"model_id":"7vE7YJQB4cWsKJW0Vs3_","task_type":"REGISTER_MODEL","function_name":"TEXT_EMBEDDING","state":"COMPLETED","worker_node":["1MZgmfuYQReNXa9WWuoh8w"],"create_time":1736781288394,"last_update_time":1736781304060,"is_async":true}

Request (deploy model)

curl https://localhost:9200/_plugins/_ml/models/7vE7YJQB4cWsKJW0Vs3_/_deploy -X POST -ku admin:$SECRET

Response

{"task_id":"9fHiYJQB4cWsKJW0is0L","task_type":"DEPLOY_MODEL","status":"CREATED"}

Register & Deploy Sparse Embedding Model

Request

curl https://localhost:9200/_plugins/_ml/models/_register -X POST -ku admin:$SECRET -H "Content-Type: application/json" -d '{
  "name": "amazon/neural-sparse/opensearch-neural-sparse-encoding-v2-distill",
  "version": "1.0.0",
  "model_group_id": "6_EiYJQB4cWsKJW0CM0k",
  "model_format": "TORCH_SCRIPT"
}'

Response

{"task_id":"7_FbYJQB4cWsKJW0Qc36","status":"CREATED"}

Request

curl https://localhost:9200/_plugins/_ml/tasks/7_FbYJQB4cWsKJW0Qc36 -ku admin:$SECRET

Response

{"model_id":"8PFbYJQB4cWsKJW0Rc1N","task_type":"REGISTER_MODEL","function_name":"SPARSE_ENCODING","state":"COMPLETED","worker_node":["1MZgmfuYQReNXa9WWuoh8w"],"create_time":1736783380985,"last_update_time":1736783401883,"is_async":true}

Request (deploy model)

curl https://localhost:9200/_plugins/_ml/models/8PFbYJQB4cWsKJW0Rc1N/_deploy -X POST -ku admin:$SECRET

Response

{"task_id":"9vHjYJQB4cWsKJW0Xc3e","task_type":"DEPLOY_MODEL","status":"CREATED"}

Test

curl https://localhost:9200/_plugins/_ml/_predict/text_embedding/7vE7YJQB4cWsKJW0Vs3_ -X POST -ku admin:$SECRET -H "Content-Type: application/json" -d '{
  "text_docs":[ "today is sunny"],
  "return_number": true,
  "target_response": ["sentence_embedding"]
}'

Response

{"inference_results":[{"output":[{"name":"sentence_embedding","data_type":"FLOAT32","shape":[384],"data":[0.22390819,0.515767,0.09902048,0.09121265,0.32544675,0.03454467,0.6585519,0.016655957,-0.3504938,0.10551805,0.027138082,-0.20051809,-0.08055397,0.35112453,-0.012349027,0.31457418,0.31982544,-0.16804285,-0.080698006,-0.22839361,-0.30455396,0.09744328,-0.5864513,-0.19503896,-0.120758764,-0.18773496,-0.22601064,-0.07164346,-0.10879912,0.16709262,-0.11501334,0.15688893,0.119460255,0.27520448,-0.10502234,0.121769704,0.22505397,-0.40116164,0.00848148,0.18667941,-0.013696276,-0.52726024,-0.09182314,0.37179697,-0.039704066,-0.019781658,-0.021398619,0.08480593,-0.016815694,0.2195061,0.25920892,-0.29929826,-0.47066906,-0.30982977,0.18248145,0.38245273,-0.2591124,-0.051396165,0.11273831,0.13533634,0.26303342,-0.24583699,-0.31428668,-0.30836853,-0.22909619,-0.31774697,-0.4039534,-0.7413893,-0.3228089,-0.58639437,-0.19537221,0.35631323,-0.11397008,-0.29998046,-0.3370001,-0.21904165,0.09691813,-0.022503292,-0.11394748,-0.48575506,0.21871702,-0.47531763,-0.040260583,-0.16060327,0.3851172,0.33334723,-0.110262536,-2.870808E-4,0.2270989,-0.00445262,0.018725878,-0.0276115,-0.39532563,-0.05454905,0.18757717,-0.01399631,-0.17815189,0.1292559,0.030375598,1.0798494,0.016931212,-0.26404852,-0.038519755,-0.15743043,0.04174735,0.05036193,-0.19863842,0.45015693,-0.1706505,-0.49011454,0.09451703,0.0049501266,0.124470554,-0.029390939,-0.27820858,-0.2664856,0.09396374,-0.0025631823,-0.2115423,0.07835799,0.13618712,0.088956825,0.18772452,-0.069695376,0.052557755,-0.32685843,0.43653098,0.13146146,0.08075389,0.69389266,0.26525652,0.124006905,0.12739356,-0.108366095,-0.23168743,-0.4893107,0.00462123,-0.028336106,0.16456765,-0.29361942,-0.05880952,0.5048212,-0.22076581,-0.19807555,0.58607787,-0.2506689,-0.29485473,0.5081113,0.0045077815,-0.18812597,-0.40864238,0.13593519,-0.21333827,0.13404264,0.10695613,0.123311125,0.42673185,0.20131981,0.49397847,-0.3975908,0.44541478,0.10301908,0.0095197875,0.008936979,-0.43452987,-0.0398502,0.023495262,0.4323562,-0.13917892,0.21397902,0.02237685,0.02506446,0.11135397,0.47648242,0.04631178,0.21962188,0.22125395,0.16389164,-0.17749937,-0.15562147,0.10902631,0.14783536,-0.056456592,-0.053764064,0.35493946,0.1444771,-0.20078687,0.11667514,0.24902378,-0.22929461,-0.376446,-0.691508,-0.25572574,-0.123766266,-0.18672617,0.21955581,-0.061512846,0.06605246,0.7560418,0.48069715,0.19175966,-0.3966469,0.72840977,0.39168215,0.7269787,0.42715928,-0.06667984,0.13541268,-0.10306603,0.40818474,0.38862133,-0.29606968,-0.07492902,0.3295484,-0.372991,0.67934704,-0.38757625,-0.091552764,-0.3955181,0.25908208,0.315939,-0.21820517,-0.0355003,-0.29630104,0.086314045,0.3341446,-0.76088554,0.24999672,0.05439572,0.3689758,0.22232993,0.1743294,-0.7385643,0.103461094,0.13335179,-0.19128257,-0.47008264,-0.10770052,-0.21769132,0.10357955,-0.15617424,0.13861407,-0.13526891,0.3617426,0.22107899,-0.19064347,-0.05481999,0.4817867,-0.1142092,0.10458332,-0.15367113,-0.02457881,-0.15232392,-0.22534658,-0.3172125,-0.15715475,-0.27059197,-0.14088397,-0.33354947,0.18303652,-0.036391575,-0.74038357,-0.22064261,0.2398618,0.10131067,-0.43316448,0.085274875,0.105638616,-0.047255825,0.3308979,-0.027554588,0.22489123,-0.6834402,0.0697661,-0.007207572,-0.32284874,-0.20076239,0.25151315,0.66055757,0.098818086,-0.29000664,-0.23169462,-0.13474114,0.07290258,-0.07718397,-0.25846946,0.2855838,0.48900378,0.20554863,-0.12921195,0.2327072,0.71542984,0.19717611,-0.20956303,-0.36730692,-0.07922702,0.11221393,-0.10331569,0.20768988,0.32281175,0.27367318,-0.024014568,0.36041388,0.2277113,-0.36075315,-0.34618685,0.117212474,-0.15960751,-0.31749764,-0.5478583,-0.11032575,-0.09749179,0.06152777,0.059870426,0.041650664,0.37712124,-0.697628,0.106907345,-0.011202368,-0.2880945,-0.07412298,-0.043658722,-0.9908965,0.057711214,-0.039407995,0.12346491,0.6785474,-0.14456493,0.066613264,-0.41864872,-0.2714641,-0.21479945,-0.085886836,-0.31146118,0.81733584,0.14995189,-0.018384822,-0.24425353,0.08844334,0.23142354,0.21220885,0.27304825,0.06194194,-0.19048585,0.20674665,0.14248197,-0.2595313,-0.086612426,0.16595031,0.21512508,0.13536172,0.20607084,-0.45519856,-0.73534226,-0.18825561,0.27604502,-0.04403585,0.1744489,-0.14719532,0.15016043,0.11614513,0.31122336,0.0155470595,0.3421605,0.21801989,-0.48239088,0.35722795,-0.1947298,0.021791764,-0.12963447,-0.32616618,0.0053886715,0.47204447,0.13764492,-0.3612307,-0.3439444,0.06707043,-0.14751774,0.44010255,-0.21332991,0.016615273,0.27413967,-0.43965015,0.5342018]}]}]}

Request

curl https://localhost:9200/_plugins/_ml/_predict/text_embedding/7vE7YJQB4cWsKJW0Vs3_ -X POST -ku admin:$SECRET -H "Content-Type: application/json" -d '{
  "text_docs":[ "today is sunny"],
  "return_number": true,
  "target_response": ["sentence_embedding"]
}'

Response

{"inference_results":[{"output":[{"name":"sentence_embedding","data_type":"FLOAT32","shape":[384],"data":[0.22390819,0.515767,0.09902048,0.09121265,0.32544675,0.03454467,0.6585519,0.016655957,-0.3504938,0.10551805,0.027138082,-0.20051809,-0.08055397,0.35112453,-0.012349027,0.31457418,0.31982544,-0.16804285,-0.080698006,-0.22839361,-0.30455396,0.09744328,-0.5864513,-0.19503896,-0.120758764,-0.18773496,-0.22601064,-0.07164346,-0.10879912,0.16709262,-0.11501334,0.15688893,0.119460255,0.27520448,-0.10502234,0.121769704,0.22505397,-0.40116164,0.00848148,0.18667941,-0.013696276,-0.52726024,-0.09182314,0.37179697,-0.039704066,-0.019781658,-0.021398619,0.08480593,-0.016815694,0.2195061,0.25920892,-0.29929826,-0.47066906,-0.30982977,0.18248145,0.38245273,-0.2591124,-0.051396165,0.11273831,0.13533634,0.26303342,-0.24583699,-0.31428668,-0.30836853,-0.22909619,-0.31774697,-0.4039534,-0.7413893,-0.3228089,-0.58639437,-0.19537221,0.35631323,-0.11397008,-0.29998046,-0.3370001,-0.21904165,0.09691813,-0.022503292,-0.11394748,-0.48575506,0.21871702,-0.47531763,-0.040260583,-0.16060327,0.3851172,0.33334723,-0.110262536,-2.870808E-4,0.2270989,-0.00445262,0.018725878,-0.0276115,-0.39532563,-0.05454905,0.18757717,-0.01399631,-0.17815189,0.1292559,0.030375598,1.0798494,0.016931212,-0.26404852,-0.038519755,-0.15743043,0.04174735,0.05036193,-0.19863842,0.45015693,-0.1706505,-0.49011454,0.09451703,0.0049501266,0.124470554,-0.029390939,-0.27820858,-0.2664856,0.09396374,-0.0025631823,-0.2115423,0.07835799,0.13618712,0.088956825,0.18772452,-0.069695376,0.052557755,-0.32685843,0.43653098,0.13146146,0.08075389,0.69389266,0.26525652,0.124006905,0.12739356,-0.108366095,-0.23168743,-0.4893107,0.00462123,-0.028336106,0.16456765,-0.29361942,-0.05880952,0.5048212,-0.22076581,-0.19807555,0.58607787,-0.2506689,-0.29485473,0.5081113,0.0045077815,-0.18812597,-0.40864238,0.13593519,-0.21333827,0.13404264,0.10695613,0.123311125,0.42673185,0.20131981,0.49397847,-0.3975908,0.44541478,0.10301908,0.0095197875,0.008936979,-0.43452987,-0.0398502,0.023495262,0.4323562,-0.13917892,0.21397902,0.02237685,0.02506446,0.11135397,0.47648242,0.04631178,0.21962188,0.22125395,0.16389164,-0.17749937,-0.15562147,0.10902631,0.14783536,-0.056456592,-0.053764064,0.35493946,0.1444771,-0.20078687,0.11667514,0.24902378,-0.22929461,-0.376446,-0.691508,-0.25572574,-0.123766266,-0.18672617,0.21955581,-0.061512846,0.06605246,0.7560418,0.48069715,0.19175966,-0.3966469,0.72840977,0.39168215,0.7269787,0.42715928,-0.06667984,0.13541268,-0.10306603,0.40818474,0.38862133,-0.29606968,-0.07492902,0.3295484,-0.372991,0.67934704,-0.38757625,-0.091552764,-0.3955181,0.25908208,0.315939,-0.21820517,-0.0355003,-0.29630104,0.086314045,0.3341446,-0.76088554,0.24999672,0.05439572,0.3689758,0.22232993,0.1743294,-0.7385643,0.103461094,0.13335179,-0.19128257,-0.47008264,-0.10770052,-0.21769132,0.10357955,-0.15617424,0.13861407,-0.13526891,0.3617426,0.22107899,-0.19064347,-0.05481999,0.4817867,-0.1142092,0.10458332,-0.15367113,-0.02457881,-0.15232392,-0.22534658,-0.3172125,-0.15715475,-0.27059197,-0.14088397,-0.33354947,0.18303652,-0.036391575,-0.74038357,-0.22064261,0.2398618,0.10131067,-0.43316448,0.085274875,0.105638616,-0.047255825,0.3308979,-0.027554588,0.22489123,-0.6834402,0.0697661,-0.007207572,-0.32284874,-0.20076239,0.25151315,0.66055757,0.098818086,-0.29000664,-0.23169462,-0.13474114,0.07290258,-0.07718397,-0.25846946,0.2855838,0.48900378,0.20554863,-0.12921195,0.2327072,0.71542984,0.19717611,-0.20956303,-0.36730692,-0.07922702,0.11221393,-0.10331569,0.20768988,0.32281175,0.27367318,-0.024014568,0.36041388,0.2277113,-0.36075315,-0.34618685,0.117212474,-0.15960751,-0.31749764,-0.5478583,-0.11032575,-0.09749179,0.06152777,0.059870426,0.041650664,0.37712124,-0.697628,0.106907345,-0.011202368,-0.2880945,-0.07412298,-0.043658722,-0.9908965,0.057711214,-0.039407995,0.12346491,0.6785474,-0.14456493,0.066613264,-0.41864872,-0.2714641,-0.21479945,-0.085886836,-0.31146118,0.81733584,0.14995189,-0.018384822,-0.24425353,0.08844334,0.23142354,0.21220885,0.27304825,0.06194194,-0.19048585,0.20674665,0.14248197,-0.2595313,-0.086612426,0.16595031,0.21512508,0.13536172,0.20607084,-0.45519856,-0.73534226,-0.18825561,0.27604502,-0.04403585,0.1744489,-0.14719532,0.15016043,0.11614513,0.31122336,0.0155470595,0.3421605,0.21801989,-0.48239088,0.35722795,-0.1947298,0.021791764,-0.12963447,-0.32616618,0.0053886715,0.47204447,0.13764492,-0.3612307,-0.3439444,0.06707043,-0.14751774,0.44010255,-0.21332991,0.016615273,0.27413967,-0.43965015,0.5342018]}]}]}

@svenseeberg
Copy link
Member Author

Next: create index with ingestion pipeline: https://opensearch.org/docs/latest/search-plugins/semantic-search/

@freinold
Copy link

May i also suggest intfloat/multilingual-e5-large as a candidate for dense embedding model?

Apparently registering a custom model is quite easy when it already has an ONNX quantisation:
https://opensearch.org/docs/latest/ml-commons-plugin/custom-local-models/

@svenseeberg
Copy link
Member Author

svenseeberg commented Jan 14, 2025

@freinold thanks. As intfloat/multilingual-e5-large supports more languages this is probably (definitely) the model we should use.

I also just realized that we don't really need a sparse embedding as OpenSearch/ElasticSearch do keyword based searches by default. No need to add another layer for that (https://opensearch.org/docs/latest/search-plugins/hybrid-search/).

@freinold
Copy link

Perfect, I also just found out that OS/ES even uses the BM25 algorithm I recommended as default:
https://opensearch.org/docs/latest/search-plugins/keyword-search/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:chat Chat Back End enhancement New feature or request prio:medium
Projects
None yet
Development

No branches or pull requests

2 participants