What dataset does this clustering prompt work with?

It works with the fine-food-reviews-with-embeddings dataset from a companion notebook. The dataset must have a precomputed embedding column plus text, score, and summary columns for cluster naming.

How does this prompt name the discovered clusters?

It samples 5 reviews from each cluster, formats them into a prompt, and asks GPT-4 (with temperature=0) to identify a one-line theme describing what the reviews have in common.

What visualization does this produce?

It creates a 2D t-SNE visualization of the clusters, plotting each cluster in a distinct color with its centroid marked to show how distinct the clusters are.

How many clusters does this create by default?

The default is 4 clusters, set via the `n_clusters` parameter in scikit-learn's KMeans, though this can be adjusted to fit your use case.

What libraries does this prompt use?

It uses scikit-learn (KMeans and TSNE), matplotlib for visualization, pandas/numpy for data handling, and the OpenAI API (gpt-4) for cluster naming.

Prompt Chain

Cluster and Name Data Groups

Name: Clustering
Availability: OnlineOnly
Author: OpenAI Cookbook

Cluster OpenAI embeddings with k-means, visualize them in 2D, and auto-name clusters with GPT-4.

Copy chain

Works with openai

OpenAI Cookbook

Own this? Claim it

Spark score

out of 100

Updated 23 days ago

Version 1.0.0

Models

gpt 4ogpt 4

Add to Favorites

Why it matters

Discover hidden groupings within your data and automatically generate descriptive names for each cluster. This helps in understanding and categorizing complex datasets.

Outcomes

What it gets done

Perform K-means clustering on datasets.

Extract representative samples from identified clusters.

Use GPT-4 to generate descriptive names for each cluster.

Visualize cluster groupings in a 2D projection.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/oai-clustering | bash

Steps

Steps in the chain

Find the clusters using K-means

Text samples in the clusters & naming the clusters

Overview

Clustering

An OpenAI Cookbook notebook that clusters embeddings with k-means, visualizes the clusters in 2D, and uses GPT-4 to name each cluster from a sample of its contents. Use it when you already have embeddings and want to discover natural groupings in unlabeled data. Cluster count changes what patterns surface - more clusters find finer distinctions.

What it does

This OpenAI Cookbook notebook demonstrates k-means clustering on embeddings (produced by the companion Get_embeddings_from_dataset notebook) to discover hidden groupings in a dataset. It runs the simplest form of k-means, visualizes the resulting clusters in a 2D projection, then uses GPT-4 to name each cluster based on a random sample of 5 reviews drawn from it.

When to use - and when NOT to

Use this notebook when you already have embeddings for a dataset and want to discover natural groupings within it - for example, distinct themes in a set of reviews - rather than working from predefined categories. The number of clusters you choose changes what the clustering surfaces: more clusters focus on more specific patterns, while fewer clusters focus on the largest overall discrepancies in the data, so clusters won't necessarily line up neatly with whatever categories you originally had in mind.

Inputs and outputs

The workflow has two stages: first find the clusters with k-means and visualize them in a 2D projection (the notebook's own example run shows one cluster standing out visually from the rest), then pull a handful of text samples per cluster and pass them to GPT-4 to generate a descriptive name for each cluster based on that sample.

Integrations

This notebook is a direct continuation of Get_embeddings_from_dataset - it consumes the embeddings that notebook produces rather than generating its own. GPT-4 is used purely for the cluster-naming step, not for the clustering itself, which is done with a standard k-means algorithm.

Who it's for

Data analysts and developers who want to explore unlabeled text data for hidden themes or segments, using embeddings plus k-means to find the groups and an LLM to make sense of what each group represents.

Source README

K-means Clustering in Python using OpenAI

We use a simple k-means algorithm to demonstrate how clustering can be done. Clustering can help discover valuable, hidden groupings within the data. The dataset is created in the Get_embeddings_from_dataset Notebook.

1. Find the clusters using K-means

We show the simplest use of K-means. You can pick the number of clusters that fits your use case best.

Visualization of clusters in a 2d projection. In this run, the green cluster (#1) seems quite different from the others. Let's see a few samples from each cluster.

2. Text samples in the clusters & naming the clusters

Let's show random samples from each cluster. We'll use gpt-4 to name the clusters, based on a random sample of 5 reviews from that cluster.

It's important to note that clusters will not necessarily match what you intend to use them for. A larger amount of clusters will focus on more specific patterns, whereas a small number of clusters will usually focus on largest discrepencies in the data.

FAQ

Common questions

Discussion