Using GPT-4o mini to tag & caption images

This notebook explores how to leverage the vision capabilities of the GPT-4* models (for example `gpt-4o`, `gpt-4o-mini` or `gpt-4-turbo`) to tag & caption images.

Get this prompt chain

Using GPT-4o mini to tag & caption images

This notebook explores how to leverage the vision capabilities of the GPT-4* models (for example gpt-4o, gpt-4o-mini or gpt-4-turbo) to tag & caption images.

We can leverage the multimodal capabilities of these models to provide input images along with additional context on what they represent, and prompt the model to output tags or image descriptions. The image descriptions can then be further refined with a language model (in this notebook, we'll use gpt-4o-mini) to generate captions.

Generating text content from images can be useful for multiple use cases, especially use cases involving search.
We will illustrate a search use case in this notebook by using generated keywords and product captions to search for products - both from a text input and an image input.

As an example, we will use a dataset of Amazon furniture items, tag them with relevant keywords and generate short, descriptive captions.

Setup

Tag images

In this section, we'll use GPT-4o mini to generate relevant tags for our products.

We'll use a simple zero-shot approach to extract keywords, and deduplicate those keywords using embeddings to avoid having multiple keywords that are too similar.

We will use a combination of an image and the product title to avoid extracting keywords for other items that are depicted in the image - sometimes there are multiple items used in the scene and we want to focus on just the one we want to tag.

Extract keywords

Testing with a few examples

Looking up existing keywords

Using embeddings to avoid duplicates (synonyms) and/or match pre-defined keywords

Testing with example keywords

Generate captions

In this section, we'll use GPT-4o mini to generate an image description and then use a few-shot examples approach with GPT-4-turbo to generate captions from the images.

If few-shot examples are not enough for your use case, consider fine-tuning a model to get the generated captions to match the style & tone you are targeting.

Describing images with GPT-4o mini

Testing on a few examples

Turning descriptions into captions

Using a few-shot examples approach to turn a long description into a short image caption

Testing on a few examples

Image search

In this section, we will use generated keywords and captions to search items that match a given input, either text or image.

We will leverage our embeddings model to generate embeddings for the keywords and captions and compare them to either input text or the generated caption from an input image.

Preparing the dataset

Processing all 312 lines of the dataset will take a while.
To test out the idea, we will only run it on the first 50 lines: this takes ~20 mins.
Feel free to skip this step and load the already processed dataset (see below).

Embedding captions and keywords

We can now use the generated captions and keywords to match relevant content to an input text query or caption.
To do this, we will embed a combination of keywords + captions.
Note: creating the embeddings will take ~3 mins to run. Feel free to load the pre-processed dataset (see below).

Search from input text

We can compare the input text from a user directly to the embeddings we just created.

Search from image

If the input is an image, we can find similar images by first turning images into captions, and embedding those captions to compare them to the already created embeddings.

Wrapping up

In this notebook, we explored how to leverage the multimodal capabilities of gpt-4o-mini to tag and caption images. By providing images along with contextual information to the model, we were able to generate tags and descriptions that can be further refined to create captions. This process has practical applications in various scenarios, particularly in enhancing search functionalities.

The search use case illustrated can be directly applied to applications such as recommendation systems, but the techniques covered in this notebook can be extended beyond items search and used in multiple use cases, for example RAG applications leveraging unstructured image data.

As a next step, you could explore using a combination of rule-based filtering with keywords and embeddings search with captions to retrieve more relevant results.

Comments (0)

Sign In Sign in to leave a comment.

Spark Drops

Weekly picks: best new AI tools, agents & prompts

Venture Crew
Terms of Service

© 2026, Venture Crew