What AI models does this prompt chain use?

It uses GPT-4o mini for tagging and captioning images, with gpt-4o or gpt-4-turbo as alternatives. For embedding keywords and captions, it uses OpenAI's text-embedding-3-large model.

How does the tagging process work?

GPT-4o mini receives a product image and title, then returns relevant keywords as a Python array covering item type, material, style, and color. New keywords are deduplicated against existing ones using embeddings and a 0.6 cosine similarity threshold, so near-duplicates like 'wooden' collapse into canonical terms like 'wood'.

How long does it take to process the example dataset?

Processing the full 312-row dataset takes a while; the source runs the tagging and captioning pipeline on just the first 50 rows (about 20 minutes), with embedding taking an additional 3 minutes. Pre-processed CSVs are available to skip these steps.

What are the inputs and outputs of this prompt chain?

Input is a table of product images with titles; for search, either a text query or an input image. Output per item includes a deduplicated keyword list, an intermediate image description, and a one-sentence caption; search queries return a ranked list of similar items with cosine-similarity scores.

What should I do if the few-shot captioning examples don't match my desired style?

The source recommends fine-tuning a model rather than continuing to adjust prompt examples if the few-shot approach doesn't produce captions in your target style or tone.

Prompt Chain

Tag and Caption Images for Enhanced Search

Name: Using GPT-4o mini to tag & caption images
Availability: OnlineOnly
Author: OpenAI Cookbook

Tag and caption product images with GPT-4o mini vision, then power text- or image-based product search.

Copy chain

Works with github openai

OpenAI Cookbook

Maintainer?

Spark score

out of 100

Updated 15 days ago

Version 1.0.0

Models

gpt 4ogpt 4

Add to Favorites

Why it matters

Leverage multimodal AI to automatically tag and caption images, significantly improving search capabilities for product catalogs and other image-heavy datasets.

Outcomes

What it gets done

Generate relevant tags for products using image and text context.

Create descriptive captions from image descriptions.

Implement image search using generated tags and captions.

Utilize embeddings for keyword deduplication and similarity matching.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/oai-tagcaptionimageswithgpt4v | bash

Steps

Steps in the chain

Tag images

Extract keywords

Looking up existing keywords

Generate captions

Describing images with GPT-4o mini

Turning descriptions into captions

Preparing the dataset

Embedding captions and keywords

Search from input text

Search from image

Overview

Using GPT-4o mini to tag & caption images

An OpenAI Cookbook notebook that tags and captions product images with GPT-4 vision models, then uses the results to power text- and image-based product search. Use it to generate searchable tags and captions from catalog images. Processing the full dataset takes a while - test on a subset or load the pre-processed data first.

What it does

This notebook uses GPT-4* vision models (gpt-4o, gpt-4o-mini, or gpt-4-turbo) to tag and caption images, using a dataset of Amazon furniture items as the worked example. It extracts keywords from each image with a zero-shot approach, deduplicates similar keywords via embeddings, generates a longer image description with GPT-4o mini, then condenses that description into a short caption using a few-shot prompting approach with GPT-4-turbo (fine-tuning is suggested as an alternative if few-shot examples aren't enough to match a specific style or tone).

When to use - and when NOT to

Use this when you need searchable text (tags and captions) generated from product or catalog images, particularly for search or recommendation use cases. The full 312-line dataset takes a while to process (the notebook suggests testing on the first 50 lines, about 20 minutes, with embedding generation adding another ~3 minutes), so plan for that runtime or load the already-processed dataset the notebook provides instead of reprocessing from scratch.

Inputs and outputs

Keyword extraction combines the image with the product title specifically to avoid tagging other items that happen to appear in the same scene. Once tags and captions exist, embeddings of the combined keywords and captions support two search modes: comparing a user's text query directly against those embeddings, or - for an image-based query - first generating a caption for the input image, then comparing that caption's embedding the same way.

Integrations

The tagging, captioning, and search pieces compose into a single pipeline: image plus title goes in, deduplicated keywords and a short caption come out, and both feed an embeddings-based search index queryable by either text or image. The notebook suggests combining rule-based keyword filtering with embeddings-based caption search as a refinement, and notes the same tag-and-caption technique generalizes beyond product search to other unstructured-image use cases, including RAG applications over image data.

Who it's for

Developers building image-aware search or recommendation systems - especially e-commerce catalogs - who need generated tags and captions to make visual content searchable by text or by example image. Both the keyword-extraction and captioning stages are tested on a handful of examples first before running across the full dataset, which is a reasonable workflow to reuse: validate the prompt on a small sample, confirm the tags and captions look right, then scale up to the full catalog rather than committing to a multi-hour run against unproven prompts.

Source README

Using GPT-4o mini to tag & caption images

This notebook explores how to leverage the vision capabilities of the GPT-4* models (for example gpt-4o, gpt-4o-mini or gpt-4-turbo) to tag & caption images.

We can leverage the multimodal capabilities of these models to provide input images along with additional context on what they represent, and prompt the model to output tags or image descriptions. The image descriptions can then be further refined with a language model (in this notebook, we'll use gpt-4o-mini) to generate captions.

Generating text content from images can be useful for multiple use cases, especially use cases involving search.
We will illustrate a search use case in this notebook by using generated keywords and product captions to search for products - both from a text input and an image input.

As an example, we will use a dataset of Amazon furniture items, tag them with relevant keywords and generate short, descriptive captions.

Setup

Tag images

In this section, we'll use GPT-4o mini to generate relevant tags for our products.

We'll use a simple zero-shot approach to extract keywords, and deduplicate those keywords using embeddings to avoid having multiple keywords that are too similar.

We will use a combination of an image and the product title to avoid extracting keywords for other items that are depicted in the image - sometimes there are multiple items used in the scene and we want to focus on just the one we want to tag.

Extract keywords

Testing with a few examples

Looking up existing keywords

Using embeddings to avoid duplicates (synonyms) and/or match pre-defined keywords

Testing with example keywords

Generate captions

In this section, we'll use GPT-4o mini to generate an image description and then use a few-shot examples approach with GPT-4-turbo to generate captions from the images.

If few-shot examples are not enough for your use case, consider fine-tuning a model to get the generated captions to match the style & tone you are targeting.

Describing images with GPT-4o mini

Testing on a few examples

Turning descriptions into captions

Using a few-shot examples approach to turn a long description into a short image caption

Testing on a few examples

Image search

In this section, we will use generated keywords and captions to search items that match a given input, either text or image.

We will leverage our embeddings model to generate embeddings for the keywords and captions and compare them to either input text or the generated caption from an input image.

Preparing the dataset

Processing all 312 lines of the dataset will take a while.
To test out the idea, we will only run it on the first 50 lines: this takes ~20 mins.
Feel free to skip this step and load the already processed dataset (see below).

Embedding captions and keywords

We can now use the generated captions and keywords to match relevant content to an input text query or caption.
To do this, we will embed a combination of keywords + captions.
Note: creating the embeddings will take ~3 mins to run. Feel free to load the pre-processed dataset (see below).

Search from input text

We can compare the input text from a user directly to the embeddings we just created.

Search from image

If the input is an image, we can find similar images by first turning images into captions, and embedding those captions to compare them to the already created embeddings.

Wrapping up

In this notebook, we explored how to leverage the multimodal capabilities of gpt-4o-mini to tag and caption images. By providing images along with contextual information to the model, we were able to generate tags and descriptions that can be further refined to create captions. This process has practical applications in various scenarios, particularly in enhancing search functionalities.

The search use case illustrated can be directly applied to applications such as recommendation systems, but the techniques covered in this notebook can be extended beyond items search and used in multiple use cases, for example RAG applications leveraging unstructured image data.

As a next step, you could explore using a combination of rule-based filtering with keywords and embeddings search with captions to retrieve more relevant results.

FAQ

Common questions

Discussion

Tag and Caption Images for Enhanced Search

What it gets done

Add it to your toolbox

Steps in the chain

Using GPT-4o mini to tag & caption images

What it does

When to use - and when NOT to

Inputs and outputs

Integrations

Who it's for

Using GPT-4o mini to tag & caption images

Setup

Tag images

Extract keywords

Testing with a few examples

Looking up existing keywords

Testing with example keywords

Generate captions

Describing images with GPT-4o mini

Testing on a few examples

Turning descriptions into captions

Testing on a few examples

Image search

Preparing the dataset

Embedding captions and keywords

Search from input text

Search from image

Wrapping up

Common questions

Questions & comments · 0