Using Chroma for Embeddings Search

This notebook takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.

Get this prompt chain

Using Chroma for Embeddings Search

This notebook takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.

What is a Vector Database

A vector database is a database made to store, manage and search embedding vectors. The use of embeddings to encode unstructured data (text, audio, video and more) as vectors for consumption by machine-learning models has exploded in recent years, due to the increasing effectiveness of AI in solving use cases involving natural language, image recognition and other unstructured forms of data. Vector databases have emerged as an effective solution for enterprises to deliver and scale these use cases.

Why use a Vector Database

Vector databases enable enterprises to take many of the embeddings use cases we've shared in this repo (question and answering, chatbot and recommendation services, for example), and make use of them in a secure, scalable environment. Many of our customers make embeddings solve their problems at small scale but performance and security hold them back from going into production - we see vector databases as a key component in solving that, and in this guide we'll walk through the basics of embedding text data, storing it in a vector database and using it for semantic search.

Demo Flow

The demo flow is:

  • Setup: Import packages and set any required variables
  • Load data: Load a dataset and embed it using OpenAI embeddings
  • Chroma:
    • Setup: Here we'll set up the Python client for Chroma. For more details go here
    • Index Data: We'll create collections with vectors for titles and content
    • Search Data: We'll run a few searches to confirm it works

Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings.

Setup

Import the required libraries and set the embedding model that we'd like to use.

Load data

In this section we'll load embedded data that we've prepared previous to this session.

Chroma

We'll index these embedded documents in a vector database and search them. The first option we'll look at is Chroma, an easy to use open-source self-hosted in-memory vector database, designed for working with embeddings together with LLMs.

In this section, we will:

  • Instantiate the Chroma client
  • Create collections for each class of embedding
  • Query each collection

Instantiate the Chroma client

Create the Chroma client. By default, Chroma is ephemeral and runs in memory.
However, you can easily set up a persistent configuration which writes to disk.

Create collections

Chroma collections allow you to store and filter with arbitrary metadata, making it easy to query subsets of the embedded data.

Chroma is already integrated with OpenAI's embedding functions. The best way to use them is on construction of a collection, as follows.
Alternatively, you can 'bring your own embeddings'. More information can be found here

Populate the collections

Chroma collections allow you to populate, and filter on, whatever metadata you like. Chroma can also store the text alongside the vectors, and return everything in a single query call, when this is more convenient.

For this use-case, we'll just store the embeddings and IDs, and use these to index the original dataframe.

Search the collections

Chroma handles embedding queries for you if an embedding function is set, like in this example.

Now that you've got a basic embeddings search running, you can hop over to the Chroma docs to learn more about how to add filters to your query, update/delete data in your collections, and deploy Chroma.

Comments (0)

Sign In Sign in to leave a comment.