This workshop is designed to help you get started with vector search using Couchbase and Node.js. We will be using the Couchbase Node.js SDK and Couchbase Capella managed database service.
The entire workshop will be run from inside a GitHub Codespace, which is a cloud-based development environment that is pre-configured with all the necessary tools and services. You don't need to install anything on your local machine.
Important
Key information needed for running this workshop in GitHub Codespaces can be found here.
- A GitHub account
- A Couchbase Capella account
- Create a Capella Account
- Create a Couchbase Cluster
- Create a Bucket
- Transform Data
- Index Data
- Search Data
- Running in GitHub Codespaces
Want to give this a go but haven't had the chance to join an in-person or online Couchbase workshop yet? Follow along with this step-by-step guided video walkthrough of this workshop!
vector.search.workshop.walkthrough.mp4
Couchbase Capella is a fully managed database service that provides a seamless experience for developers to build modern applications. You can sign up for a free account at https://cloud.couchbase.com/signup.
Once you have created an account, you can create a new Couchbase cluster by following the steps below:
- Click on the "Create Cluster" button on the Capella dashboard.
- Choose a cloud provider, name and region for your cluster and click on the "Create Cluster" button.
After creating a cluster, you can create a new bucket by following the steps below:
- Click on the "+ Create" button from inside the cluster dashboard.
- Define the options for your bucket and click on the "Create" button.
Before we can index and search data, we need to transform it into a format that can be used by the vector search engine. We will be using Couchbase Vector Search for this workshop.
There are two options in this workshop to generate vector embeddings from data:
- Use the
/embed
endpoint provided in this repository to transform the data. You need an OpenAI API key to use this option. - Import directly the data with already generated embeddings into the Couchbase bucket. You can use the data provided in the
./data/individual_items_with_embedding
directory.
This workshop gives you the flexibility to choose between generating embeddings locally or using the OpenAI API.
- If you have pre-generated embeddings (provided in the repository), you can use the
useLocalEmbedding
flag to avoid using the OpenAI API. - If you want to generate embeddings dynamically from the text, you need to provide your OpenAI API key and set the
useLocalEmbedding
flag tofalse
.
In the .env
file, set the USE_LOCAL_EMBEDDING
flag to control the mode:
USE_LOCAL_EMBEDDING=true
true
: Use pre-generated embeddings (no OpenAI API key required).false
: Use OpenAI API to generate embeddings (OpenAI API key required).
Make sure to set the OPENAI_API_KEY
in the .env
file if you set USE_LOCAL_EMBEDDING
to false
.
OPENAI_API_KEY=your_openai_api_key
Follow the instructions below for the option you choose.
Provided in this repository is an Express.js application that will expose a /embed
endpoint to transform the data.
The Codespace environment already has all the dependencies installed. You can start the Express.js application by running the following command:
node server.js
The repository also has a sample set of data in the ./data/individual_items
directory. You can transform this data by making a POST
request to the /embed
endpoint providing the paths to the data files as an array in the request body.
curl -X POST http://localhost:3000/embed -H "Content-Type: application/json" -d '["./data/data1.json", "./data/data2.json"]'
The data has now been converted into vector embeddings and stored in the Couchbase bucket that you created earlier.
If you choose to import the data directly, you can use the data provided in the ./data/individual_items_with_embedding
directory. The data is already in the format required to enable vector search on it.
Once you have opened this repository in a GitHub Codespace, you can import the data with the generated embeddings using the Couchbase shell from the command line.
First, edit the ./config_file/config
file with your Couchbase Capella information.
You can find a pre-filled config file in the Couchbase Capella dashboard under the "Connect" tab.
Once you click on the "Connect" tab, you will see a section called "Couchbase Shell" among the options on the left-hand menu. You can choose the access credentials for the shell and copy the config file content provided and paste it in the ./config_file/config file.
Make sure to add the default-bucket
field to the config file with the name of the bucket you created earlier.
default-bucket = "your-bucket-name"
You can find an example config file in the ./config_file
directory for reference.
Change into the directory where the data files with embeddings are:
cd data/individual_items_with_embedding
Open up Couchbase shell passing in an argument with the location of the config file defining your Couchbase information:
cbsh --config-dir ../../config-file
Once in the shell, run the nodes
command to just perform a sanity check that you are connected to the correct cluster.
> nodes
This should output something similar to the following:
╭───┬───────────┬────────────────┬─────────┬──────────────────────────┬───────────────────────┬───────────────────────────┬──────────────┬─────────────┬─────────╮
│ # │ cluster │ hostname │ status │ services │ version │ os │ memory_total │ memory_free │ capella │
├───┼───────────┼────────────────┼─────────┼──────────────────────────┼───────────────────────┼───────────────────────────┼──────────────┼─────────────┼─────────┤
│ 0 │ dev.local │ 127.0.0.1:8091 │ healthy │ search,indexing,kv,query │ 8.0.0-1246-enterprise │ x86_64-apple-darwin19.6.0 │ 34359738368 │ 12026126336 │ false │
╰───┴───────────┴────────────────┴─────────┴──────────────────────────┴───────────────────────┴───────────────────────────┴──────────────┴─────────────┴─────────╯
Now, import the data into the bucket you created earlier:
ls *_with_embedding.json | each { |it| open $it.name | wrap content | insert id $in.content._default.name } | doc upsert
While, in this workshop we are focused on creating a vector search index, you can also create a primary search index to enable full-text search on the data:
query "create primary index on name_of_your_bucket._default._default"
Make sure to replace the name_of_your_bucket
with the name of your bucket you created.
Once this is done, you can perform a sanity check to ensure the documents were inserted by running a query to select just one:
query "select * from name_of_your_bucket._default._default limit 1"
Replace the name_of_your_bucket
with the name of your bucket you created.
Once the vector embeddings have been stored in the Couchbase bucket, we can create a vector search index to enable similarity search.
You will use Couchbase Shell to perform this action as well.
Run the following command from inside the shell:
vector create-index --bucket name_of_your_bucket --similarity-metric dot_product vector-search-index embedding 1536
Replace the name_of_your_bucket
with the name of your bucket you created.
You can perform a santity check to ensure the index was created by querying for all the indexes and you should see the vector_search_index
in the list:
query indexes
Now that the data has been indexed, you can perform similarity searches using the vector search index.
You can use the /search
endpoint provided in this repository to search for similar items based on a query item. The endpoint will return the top 5 most similar items.
The Codespace environment already has all the dependencies installed. You can start the Express.js application by running the following command:
node server.js
Once the server is running, you can either search using the provided query with the embedding already generated or you can provide your own query item.
You can search for similar items based on the provided query item by making a POST
request to the /search
endpoint.
Here is an example cURL
command to search for similar items based on the provided query item:
curl -X POST http://localhost:3000/search \
-H "Content-Type: application/json" \
-d '{"q": "", "useLocalEmbedding": true}'
As you can see, we use the useLocalEmbedding
flag to indicate that we want to use the provided query item and we keep the q
field empty.
If you want to search for similar items based on your own query item, you can provide the query item in the request body.
The query will be automatically converted into a vector embedding using the OpenAI API. You need to provide your OpenAI API key in the .env file
before starting the Express.js application.
Here is an example cURL command to search for similar items based on your own query item:
curl -X POST http://localhost:3000/search \
-H "Content-Type: application/json" \
-d '{"q": "your_query_item"}'
When working in a GitHub Codespaces environment, there are some differences to be aware of, especially for the Couchbase Shell commands.
- The
cbsh
binary is not available in yourPATH
by default in Codespaces. Instead, you can find it in the following directory within your workspace:
couchbase-shell/target/debug/cbsh
To use cbsh
in your Codespace, provide the full path wwhen running commands. For example:
./couchbase-shell/target/debug/cbsh --config-dir /path/to/config-file
You can also create an alias for the cbsh
binary to make it easier to use:
alias cbsh="./couchbase-shell/target/debug/cbsh"
This allows you to run cbsh
commands without specifying the full path.
Other than that, Codespaces comes pre-configured with all the dependencies necessary to run this workshop.