Prompt Chain

Build Semantic Search with MongoDB Atlas Vector Search

Jupyter notebook tutorial demonstrating semantic search with OpenAI embeddings and MongoDB Atlas vector search, including environment setup, embedding

Works with openaimongodb

91
Spark score
out of 100
Updated 3 months ago
Version 1.0.0
Models

Add to Favorites

Why it matters

Leverage OpenAI embeddings and MongoDB Atlas Vector Search to build a powerful semantic search application. This asset guides you through setting up your environment, generating embeddings, and querying your data for intent-based search results.

Outcomes

What it gets done

01

Set up MongoDB Atlas cluster and OpenAI API key.

02

Generate vector embeddings for your data using OpenAI.

03

Create and configure a vector search index in MongoDB Atlas.

04

Perform semantic searches on your data using vector similarity.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/oai-semanticsearchusingmongodbatlasvectorsearch | bash

Steps

Steps in the chain

01
Step 1: Setup the environment

Create a MongoDB Atlas cluster (version 6.0.11 or higher) and obtain an OpenAI API key. Set up a database user and network connection rules for your MongoDB cluster. Load the sample_mflix dataset using the Atlas UI, which contains a movies collection with fields like title, plot, genres, cast, and directors.

02
Step 2: Setup embeddings generation function

Create a function to generate embeddings using the OpenAI embeddings API endpoint.

03
Step 3: Create and store embeddings

Execute an operation to create vector embeddings for the 'plot' field in each movie document and store them in the database. This adds an 'embedding' field to each document in the movies collection. Restrict to 500 documents for efficiency, or use the pre-populated sample_mflix.embedded_movies collection.

04
Step 4: Create a vector search index

Create an Atlas Vector Search Index on the collection to enable Approximate KNN search. Use the JSON index definition with 1536 dimensions for OpenAI text-embedding-ada002, dotProduct similarity, and knnVector type. Create via Atlas UI or programmatically using the pymongo driver.

05
Step 5: Query your data

Execute semantic search queries to find movies with semantically similar plots to your query string, rather than keyword-based results.

Overview

Step 1: Setup the environment

What it does

A step-by-step notebook tutorial for building a semantic search application using MongoDB Atlas vector search and OpenAI embeddings, demonstrated on a movie plot dataset.

How it connects

Use this when you need to implement semantic search functionality that retrieves results based on meaning and intent rather than exact keyword matches, particularly when working with MongoDB Atlas and OpenAI's embedding models.

Source README

This notebook demonstrates how to build a semantic search application using OpenAI and MongoDB Atlas vector search

Step 1: Setup the environment

There are 2 pre-requisites for this:

  1. MongoDB Atlas cluster: To create a forever free MongoDB Atlas cluster, first, you need to create a MongoDB Atlas account if you don't already have one. Visit the MongoDB Atlas website and click on “Register.” Visit the MongoDB Atlas dashboard and set up your cluster. In order to take advantage of the $vectorSearch operator in an aggregation pipeline, you need to run MongoDB Atlas 6.0.11 or higher. This tutorial can be built using a free cluster. When you’re setting up your deployment, you’ll be prompted to set up a database user and rules for your network connection. Please ensure you save your username and password somewhere safe and have the correct IP address rules in place so your cluster can connect properly. If you need more help getting started, check out our tutorial on MongoDB Atlas.

  2. OpenAI API key To create your OpenAI key, you'll need to create an account. Once you have that, visit the OpenAI platform. Click on your profile icon in the top right of the screen to get the dropdown menu and select “View API keys”.

Note: After executing the step above you will be prompted to enter the credentials.

For this tutorial, we will be using the
MongoDB sample dataset. Load the sample dataset using the Atlas UI. We'll be using the “sample_mflix” database, which contains a “movies” collection where each document contains fields like title, plot, genres, cast, directors, etc.

Step 2: Setup embeddings generation function

Step 3: Create and store embeddings

Each document in the sample dataset sample_mflix.movies corresponds to a movie; we will execute an operation to create a vector embedding for the data in the "plot" field and store it in the database. Creating vector embeddings using OpenAI embeddings endpoint is necessary for performing a similarity search based on intent.

After executing the above, the documents in "movies" collection will contain an additional field of "embedding", as defined by the EMBEDDDING_FIELD_NAME variable, apart from already existing fields like title, plot, genres, cast, directors, etc.

Note: We are restricting this to just 500 documents in the interest of time. If you want to do this over the entire dataset of 23,000+ documents in our sample_mflix database, it will take a little while. Alternatively, you can use the sample_mflix.embedded_movies collection which includes a pre-populated plot_embedding field that contains embeddings created using OpenAI's text-embedding-3-small embedding model that you can use with the Atlas Search vector search feature.

Step 4: Create a vector search index

We will create Atlas Vector Search Index on this collection which will allow us to perform the Approximate KNN search, which powers the semantic search.
We will cover 2 ways to create this index - Atlas UI and using MongoDB python driver.

(Optional) Documentation: Create a Vector Search Index

Now head over to Atlas UI and create an Atlas Vector Search index using the steps descibed here. The 'dimensions' field with value 1536, corresponds to openAI text-embedding-ada002.

Use the definition given below in the JSON editor on the Atlas UI.

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "embedding": {
        "dimensions": 1536,
        "similarity": "dotProduct",
        "type": "knnVector"
      }
    }
  }
}

(Optional) Alternatively, we can use pymongo driver to create these vector search indexes programatically
The python command given in the cell below will create the index (this only works for the most recent version of the Python Driver for MongoDB and MongoDB server version 7.0+ Atlas cluster).

Step 5: Query your data

The results for the query here finds movies which have semantically similar plots to the text captured in the query string, rather than being based on the keyword search.

(Optional) Documentation: Run Vector Search Queries

Step 1: Step 1: Setup the environment

Create a MongoDB Atlas cluster (version 6.0.11 or higher) and obtain an OpenAI API key. Set up a database user and network connection rules for your MongoDB cluster. Load the sample_mflix dataset using the Atlas UI, which contains a movies collection with fields like title, plot, genres, cast, and directors.

Step 2: Step 2: Setup embeddings generation function

Create a function to generate embeddings using the OpenAI embeddings API endpoint.

Step 3: Step 3: Create and store embeddings

Execute an operation to create vector embeddings for the 'plot' field in each movie document and store them in the database. This adds an 'embedding' field to each document in the movies collection. Restrict to 500 documents for efficiency, or use the pre-populated sample_mflix.embedded_movies collection.

Step 4: Step 4: Create a vector search index

Create an Atlas Vector Search Index on the collection to enable Approximate KNN search. Use the JSON index definition with 1536 dimensions for OpenAI text-embedding-ada002, dotProduct similarity, and knnVector type. Create via Atlas UI or programmatically using the pymongo driver.

Step 5: Step 5: Query your data

Execute semantic search queries to find movies with semantically similar plots to your query string, rather than keyword-based results.

Discussion

Questions & comments · 0

Sign In Sign in to leave a comment.