Cluster and Describe Transactional Data
Multi-step prompt workflow that uses embeddings and K-Means clustering to automatically group unlabeled transaction data, then generates human-readable cluster
Why it matters
Leverage unsupervised learning to cluster unlabeled transactional data based on embeddings. Use LLM to generate human-readable descriptions for each cluster, enabling effective labeling of previously unclassified transactions.
Outcomes
What it gets done
Generate embeddings for transactional data.
Apply K-Means clustering to group similar transactions.
Utilize LLM to create descriptive labels for identified clusters.
Visualize and refine cluster effectiveness for improved classification.
Install
Add it to your toolbox
Run in your project directory:
curl -fsSL https://spark.entire.vc/get/oai-clusteringfortransactionclassification | bash Steps
Steps in the chain
Prepare the environment and data for clustering analysis. Set up embeddings created using the approach from the Multiclass classification for transactions Notebook, applied to the full 359 transactions in the dataset.
Reuse the approach from the Clustering Notebook, using K-Means to cluster the dataset using the feature embeddings created previously. Then use the Completions endpoint to generate cluster descriptions and judge their effectiveness.
Overview
Clustering for Transaction Classification
What it does
A notebook demonstrating how to apply K-Means clustering to transaction embeddings and use GPT-3 to generate human-readable descriptions for the resulting clusters, enabling the labeling of previously unlabeled data.
How it connects
Use this when you have unlabeled transaction data with features that can be grouped into meaningful categories, and you need to generate interpretable cluster descriptions to create labels for classification tasks.
Source README
Clustering for Transaction Classification
This notebook covers use cases where your data is unlabelled but has features that can be used to cluster them into meaningful categories. The challenge with clustering is making the features that make those clusters stand out human-readable, and that is where we'll look to use GPT-3 to generate meaningful cluster descriptions for us. We can then use these to apply labels to a previously unlabelled dataset.
To feed the model we use embeddings created using the approach displayed in the notebook Multiclass classification for transactions Notebook, applied to the full 359 transactions in the dataset to give us a bigger pool for learning
Setup
Clustering
We'll reuse the approach from the Clustering Notebook, using K-Means to cluster our dataset using the feature embeddings we created previously. We'll then use the Completions endpoint to generate cluster descriptions for us and judge their effectiveness
Conclusion
We now have five new clusters that we can use to describe our data. Looking at the visualisation some of our clusters have some overlap and we'll need some tuning to get to the right place, but already we can see that GPT-3 has made some effective inferences. In particular, it picked up that items including legal deposits were related to literature archival, which is true but the model was given no clues on. Very cool, and with some tuning we can create a base set of clusters that we can then use with a multiclass classifier to generalise to other transactional datasets we might use.
Step 1: Setup
Prepare the environment and data for clustering analysis. Set up embeddings created using the approach from the Multiclass classification for transactions Notebook, applied to the full 359 transactions in the dataset.
Step 2: Clustering
Reuse the approach from the Clustering Notebook, using K-Means to cluster the dataset using the feature embeddings created previously. Then use the Completions endpoint to generate cluster descriptions and judge their effectiveness.
Discussion
Questions & comments · 0
Sign In Sign in to leave a comment.