Prompt Chain

Generate Embeddings from Large Datasets

A Jupyter notebook workflow that retrieves vector embeddings from datasets, demonstrated with 1,000 Amazon fine-food reviews for text encoding and reuse.

Works with openaipandastransformerstorchscikit learn

77
Spark score
out of 100
Updated 3 months ago
Version 1.0.0
Models

Add to Favorites

Why it matters

Process large text datasets to generate vector embeddings for downstream machine learning tasks. This asset is ideal for preparing unstructured data for analysis or model training.

Outcomes

What it gets done

01

Load and preprocess text data from a specified dataset.

02

Generate vector embeddings for combined review summaries and text.

03

Save generated embeddings for efficient future reuse.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/oai-getembeddingsfromdataset | bash

Steps

Steps in the chain

01
Load the dataset

The dataset used in this example is fine-food reviews from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text). We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding. To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy.

02
Get embeddings and save them for future reuse

Get embeddings from the combined review text and save them for future reuse.

Overview

Get embeddings from dataset

What it does

Jupyter notebook demonstrating embedding extraction from datasets using Amazon fine-food reviews

How it connects

When you need to generate and save embeddings from text datasets for reuse in ML tasks

Source README

Get embeddings from dataset

This notebook gives an example on how to get embeddings from a large dataset.

1. Load the dataset

The dataset used in this example is fine-food reviews from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).

We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding.

To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy.

2. Get embeddings and save them for future reuse

Step 1: Load the dataset

The dataset used in this example is fine-food reviews from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text). We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding. To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy.

Step 2: Get embeddings and save them for future reuse

Get embeddings from the combined review text and save them for future reuse.

FAQ

Common questions

Discussion

Questions & comments · 0

Sign In Sign in to leave a comment.