Using GPT-4o mini to tag & caption images
This notebook explores how to leverage the vision capabilities of the GPT-4* models (for example `gpt-4o`, `gpt-4o-mini` or `gpt-4-turbo`) to tag & caption images.
Get this prompt chain
Using GPT-4o mini to tag & caption images
This notebook explores how to leverage the vision capabilities of the GPT-4* models (for example gpt-4o, gpt-4o-mini or gpt-4-turbo) to tag & caption images.
We can leverage the multimodal capabilities of these models to provide input images along with additional context on what they represent, and prompt the model to output tags or image descriptions. The image descriptions can then be further refined with a language model (in this notebook, we'll use gpt-4o-mini) to generate captions.
Generating text content from images can be useful for multiple use cases, especially use cases involving search.
We will illustrate a search use case in this notebook by using generated keywords and product captions to search for products - both from a text input and an image input.
As an example, we will use a dataset of Amazon furniture items, tag them with relevant keywords and generate short, descriptive captions.
Setup
Tag images
In this section, we'll use GPT-4o mini to generate relevant tags for our products.
We'll use a simple zero-shot approach to extract keywords, and deduplicate those keywords using embeddings to avoid having multiple keywords that are too similar.
We will use a combination of an image and the product title to avoid extracting keywords for other items that are depicted in the image - sometimes there are multiple items used in the scene and we want to focus on just the one we want to tag.
Extract keywords
Testing with a few examples
Looking up existing keywords
Using embeddings to avoid duplicates (synonyms) and/or match pre-defined keywords
Testing with example keywords
Generate captions
In this section, we'll use GPT-4o mini to generate an image description and then use a few-shot examples approach with GPT-4-turbo to generate captions from the images.
If few-shot examples are not enough for your use case, consider fine-tuning a model to get the generated captions to match the style & tone you are targeting.
Describing images with GPT-4o mini
Testing on a few examples
Turning descriptions into captions
Using a few-shot examples approach to turn a long description into a short image caption
Testing on a few examples
Image search
In this section, we will use generated keywords and captions to search items that match a given input, either text or image.
We will leverage our embeddings model to generate embeddings for the keywords and captions and compare them to either input text or the generated caption from an input image.
Preparing the dataset
Processing all 312 lines of the dataset will take a while.
To test out the idea, we will only run it on the first 50 lines: this takes ~20 mins.
Feel free to skip this step and load the already processed dataset (see below).
Embedding captions and keywords
We can now use the generated captions and keywords to match relevant content to an input text query or caption.
To do this, we will embed a combination of keywords + captions.
Note: creating the embeddings will take ~3 mins to run. Feel free to load the pre-processed dataset (see below).
Search from input text
We can compare the input text from a user directly to the embeddings we just created.
Search from image
If the input is an image, we can find similar images by first turning images into captions, and embedding those captions to compare them to the already created embeddings.
Wrapping up
In this notebook, we explored how to leverage the multimodal capabilities of gpt-4o-mini to tag and caption images. By providing images along with contextual information to the model, we were able to generate tags and descriptions that can be further refined to create captions. This process has practical applications in various scenarios, particularly in enhancing search functionalities.
The search use case illustrated can be directly applied to applications such as recommendation systems, but the techniques covered in this notebook can be extended beyond items search and used in multiple use cases, for example RAG applications leveraging unstructured image data.
As a next step, you could explore using a combination of rule-based filtering with keywords and embeddings search with captions to retrieve more relevant results.