How does this handle the fact that GPT-4.1-mini doesn't accept video directly?

It extracts static frames from the video using OpenCV's cv2.VideoCapture, JPEG-encodes and base64-encodes them, then sends a sampled subset (every 25th frame) to the model's vision input alongside text prompts.

What does the second example do with the generated script?

It passes the voiceover script to OpenAI's text-to-speech endpoint using the gpt-4o-mini-tts model with detailed natural-language instructions controlling voice affect, tone, pacing, and emotional register, producing a synthesized audio voiceover in WAV format.

When should I NOT use this pattern?

This pattern is not suited for tasks requiring frame-accurate temporal understanding, such as exact motion timing or frame-by-frame event detection, because sparse frame sampling loses fine-grained temporal detail between sampled frames.

What input does this prompt chain expect?

The input is a local video file, and the output is either a natural-language description of the video's content or a styled narration script converted into synthesized speech audio in WAV format.

Prompt Chain

Narrate Videos with AI-Generated Voiceovers

Name: Processing and narrating a video with GPT-4.1-mini's visual capabilities and GPT-4o TTS API
Availability: OnlineOnly
Author: OpenAI Cookbook

Describe and narrate a video using GPT-4.1-mini's vision on extracted frames plus GPT-4o TTS for voiceover.

Copy chain

Works with openai

OpenAI Cookbook

Maintainer?

Spark score

out of 100

Updated 19 days ago

Version 1.0.0

Models

gpt 4ogpt 4

Add to Favorites

Why it matters

Transform video content into engaging narratives by leveraging advanced AI to describe visual scenes and generate human-like voiceovers.

Outcomes

What it gets done

Extract key frames from video content.

Generate descriptive summaries of video scenes using GPT-4 vision.

Create scripts for video narration.

Produce AI-generated voiceovers using GPT-4o TTS API.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/oai-gptwithvisionforvideounderstanding | bash

Steps

Steps in the chain

Extract video frames using OpenCV

Get video description with GPT-4.1-mini

Generate voiceover script with GPT-4.1-mini

Generate audio with GPT-4o TTS API

Overview

Processing and narrating a video with GPT-4.1-mini's visual capabilities and GPT-4o TTS API

An OpenAI Cookbook notebook that describes and narrates video content by extracting frames with OpenCV, prompting GPT-4.1-mini's vision capabilities, and generating voiceover audio with GPT-4o TTS. Use it to get a text description or narrated voiceover of video content without a video-native model. Not every frame needs sending - sampling is enough for the model to understand the scene.

What it does

This notebook shows how to apply GPT-4.1-mini's visual capabilities to a video, even though the model doesn't accept video input directly. Frames are extracted from the video with OpenCV, and because GPT-4.1-mini has a 1M-token context window, a whole set of static frames can be described at once rather than one at a time. It covers two examples: getting a description of a video, and generating a spoken voiceover for it with the GPT-4o TTS API.

When to use - and when NOT to

Use this pattern when you want a text or audio description of video content without a model that natively ingests video - GPT-4.1-mini works from a curated set of extracted frames instead. Not every frame needs to be sent for the model to understand what's happening in the video, so frame sampling matters more than exhaustive frame coverage.

Inputs and outputs

The worked example uses a nature video of bisons and wolves: frames are extracted with OpenCV and displayed to confirm they were read correctly, then passed to GPT-4.1-mini with a prompt asking for a description of the scene. For the voiceover example, the same frames are used to prompt GPT-4.1-mini for a short narration script written in the style of David Attenborough.

Integrations

The generated script is passed to the GPT-4o TTS API along with instructions on how the voice should sound, producing narrated audio for the video. Voice styles and instructions can be experimented with directly at OpenAI.fm before wiring them into the pipeline.

Who it's for

Developers who want to generate descriptions or narrated voiceovers for video content using GPT's vision and text-to-speech capabilities, without a video-native model.

Source README

Processing and narrating a video with GPT-4.1-mini's visual capabilities and GPT-4o TTS API

This notebook demonstrates how to use GPT's visual capabilities with a video. Although GPT-4.1-mini doesn't take videos as input directly, we can use vision and the 1M token context window to describe the static frames of a whole video at once. We'll walk through two examples:

Using GPT-4.1-mini to get a description of a video
Generating a voiceover for a video with GPT-4o TTS API

1. Using GPT's visual capabilities to get a description of a video

First, we use OpenCV to extract frames from a nature video containing bisons and wolves:

Display frames to make sure we've read them in correctly:

Once we have the video frames, we craft our prompt and send a request to GPT (Note that we don't need to send every frame for GPT to understand what's going on):

2. Generating a voiceover for a video with GPT-4.1 and the GPT-4o TTS API

Let's create a voiceover for this video in the style of David Attenborough. Using the same video frames we prompt GPT to give us a short script:

Now, we can work with the GPT-4o TTS model and provide it a set of instructions on how the voice should sound. You can play around with the voice models and instructers at OpenAI.fm. We can then pass in the script we generated above with GPT-4.1-mini and generate audio of the voiceover:

FAQ

Common questions

Discussion

Narrate Videos with AI-Generated Voiceovers

What it gets done

Add it to your toolbox

Steps in the chain

Processing and narrating a video with GPT-4.1-mini's visual capabilities and GPT-4o TTS API

What it does

When to use - and when NOT to

Inputs and outputs

Integrations

Who it's for

Processing and narrating a video with GPT-4.1-mini's visual capabilities and GPT-4o TTS API

1. Using GPT's visual capabilities to get a description of a video

2. Generating a voiceover for a video with GPT-4.1 and the GPT-4o TTS API

Common questions

Questions & comments · 0