Generate Embeddings from Large Datasets
A Jupyter notebook workflow that retrieves vector embeddings from datasets, demonstrated with 1,000 Amazon fine-food reviews for text encoding and reuse.
Why it matters
Process large text datasets to generate vector embeddings for downstream machine learning tasks. This asset is ideal for preparing unstructured data for analysis or model training.
Outcomes
What it gets done
Load and preprocess text data from a specified dataset.
Generate vector embeddings for combined review summaries and text.
Save generated embeddings for efficient future reuse.
Install
Add it to your toolbox
Run in your project directory:
curl -fsSL https://spark.entire.vc/get/oai-getembeddingsfromdataset | bash Steps
Steps in the chain
The dataset used in this example is fine-food reviews from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text). We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding. To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy.
Get embeddings from the combined review text and save them for future reuse.
Overview
Get embeddings from dataset
What it does
Jupyter notebook demonstrating embedding extraction from datasets using Amazon fine-food reviews
How it connects
When you need to generate and save embeddings from text datasets for reuse in ML tasks
Source README
Get embeddings from dataset
This notebook gives an example on how to get embeddings from a large dataset.
1. Load the dataset
The dataset used in this example is fine-food reviews from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).
We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding.
To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy.
2. Get embeddings and save them for future reuse
Step 1: Load the dataset
The dataset used in this example is fine-food reviews from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text). We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding. To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy.
Step 2: Get embeddings and save them for future reuse
Get embeddings from the combined review text and save them for future reuse.
FAQ
Common questions
Discussion
Questions & comments · 0
Sign In Sign in to leave a comment.