Transcribe Audio with Realtime Model
Use the OpenAI Realtime model for superior, context-aware audio transcription, ensuring accuracy for downstream tasks and avoiding mismatches with separate
Why it matters
Achieve highly accurate, context-aware audio transcription by leveraging the OpenAI Realtime model itself, bypassing the limitations of separate transcription services for more reliable downstream actions.
Outcomes
What it gets done
Stream microphone audio to an OpenAI Realtime voice agent.
Generate high-quality text transcripts using the same Realtime model.
Ensure transcription accuracy by utilizing full session context.
Avoid inconsistencies between spoken words and agent interpretation.
Install
Add it to your toolbox
Run in your project directory:
curl -fsSL https://spark.entire.vc/get/oai-realtimeoutofbandtranscription | bash Steps
Steps in the chain
Understand the advantages of using the Realtime model for out-of-band transcription instead of separate ASR models. Key benefits include: reduced mismatch between transcription and generation using the same model, greater steerability with custom instructions, and session context awareness for improved accuracy. Consider trade-offs: Realtime transcription costs approximately $48 per 1M tokens vs $16 for GPT-4o transcription, but provides better quality and consistency.
Ensure your environment meets requirements: Python 3.10 or later, PortAudio (brew install portaudio on macOS), Python dependencies (pip install sounddevice websockets), and OpenAI API Key with Realtime API access. Set your API key as an environment variable: export OPENAI_API_KEY=sk-...
Create two distinct prompts: (1) Voice Agent Prompt (REALTIME_MODEL_PROMPT) for Speech-to-Speech interactions, and (2) Transcription Prompt (REALTIME_MODEL_TRANSCRIPTION_PROMPT) that silently returns a precise, verbatim transcript of the user's most recent speech turn. Iterate on the transcription prompt to tailor it to your specific use case.
Define core configuration including imports, audio and model defaults, and constants for transcription event handling.
Configure the Realtime session (session.update) with audio input/output, server-side VAD, and built-in transcription. Trigger out-of-band transcription via response.create after user input audio is committed (input_audio_buffer.committed) using conversation: 'none' and output_modalities: ['text'] to avoid writing to main conversation state.
Define audio streaming functions: encode_audio (base64 helper), playback_audio (play assistant audio on default output device), send_audio_from_queue (send buffered mic audio to input_audio_buffer), and stream_microphone_audio (capture PCM16 from mic and feed the queue).
Generate two transcripts for each user turn: Realtime model transcript from out-of-band response.create call, and built-in ASR transcript from standard transcription model. Align and display both clearly in terminal output for comparison.
Implement listen_for_events function to drive the session: watch for speech_started/speech_stopped/committed events, send out-of-band transcription request when user turn finishes, calculate token usage and cost for both transcription methods, stream assistant audio to playback queue, and buffer text deltas per response_id.
Execute the code to view Realtime model transcription vs transcription model transcriptions. The script loads configuration and prompts, establishes WebSocket connection, starts concurrent tasks (listen_for_events, stream_microphone_audio, playback_audio), mutes mic when assistant speaks, and prints both transcripts when returned. Run until interrupted.
Overview
Transcribing User Audio with a Separate Realtime Request
What it does
**Purpose**: This notebook demonstrates how to use the Realtime model itself to accurately transcribe user audio `out-of-band` using the same websocket session connection, avoiding errors and inconsistencies common when relying on a separate transcription model (gpt-4o-transcribe/whisper-1).
Source README
Transcribing User Audio with a Separate Realtime Request
Purpose: This notebook demonstrates how to use the Realtime model itself to accurately transcribe user audio out-of-band using the same websocket session connection, avoiding errors and inconsistencies common when relying on a separate transcription model (gpt-4o-transcribe/whisper-1).
We call this out-of-band transcription using the Realtime model. It’s simply a second response.create request on the same Realtime WebSocket, tagged so it doesn’t write back to the active conversation state. The model runs again with a different set of instructions (a transcription prompt), triggering a new inference pass that’s separate from the assistant’s main speech turn.
It covers how to build a server-to-server client that:
- Streams microphone audio to an OpenAI Realtime voice agent.
- Plays back the agent's spoken replies.
- After each user turn, generates a high-quality text-only transcript using the same Realtime model.
This is achieved via a secondary response.create request:
{
"type": "response.create",
"response": {
"conversation": "none",
"output_modalities": ["text"],
"instructions": transcription_instructions
}
}
This notebook demonstrates using the Realtime model itself for transcription:
- Context-aware transcription: Uses the full session context to improve transcript accuracy.
- Non-intrusive: Runs outside the live conversation, so the transcript is never added back to session state.
- Customizable instructions: Allows tailoring transcription prompts to specific use-cases. Realtime model is better than the transcription model at following instructions.
1. Why use out-of-band transcription?
The Realtime API offers built-in user input transcription, but this relies on a separate ASR model (e.g., gpt-4o-transcribe). Using different models for transcription and response generation can lead to discrepancies. For example:
- User speech transcribed as:
I had otoo accident - Realtime response interpreted correctly as:
Got it, you had an auto accident
Accurate transcriptions can be very important, particularly when:
- Transcripts trigger downstream actions (e.g., tool calls), where errors propagate through the system.
- Transcripts are summarized or passed to other components, risking context pollution.
- Transcripts are displayed to end users, leading to poor user experiences if errors occur.
The potential advantages of using out-of-band transcription include:
- Reduced Mismatch: The same model is used for both transcription and generation, minimizing inconsistencies between what the user says and how the agent responds.
- Greater Steerability: The Realtime model is more steerable, can better follow custom instructions for higher transcription quality, and is not limited by a 1024-token input maximum.
- Session Context Awareness: The model has access to the full session context, so, for example, if you mention your name multiple times, it will transcribe it correctly.
In terms of trade-offs:
Realtime Model (for transcription):
Audio Input → Text Output: $32.00 per 1M audio tokens + $16.00 per 1M text tokens out.
Cached Session Context: $0.40 per 1M cached context tokens.
Total Cost (for 1M audio tokens in + 1M text tokens out): ≈ $48.00
GPT-4o Transcription:
Audio Input: $6.00 per 1M audio tokens
Text Input: $2.50 per 1M tokens.
Text Output: $10.00 per 1M tokens
Total Cost (for 1M audio tokens in + 1M text tokens out): ≈ $16.00
Direct Cost Comparison (see examples in the end of the cookbook):
Using full session context: 16-22x (if transcription cost is 0.001$/session, realtime transcription will be 0.016$/session)
- The cost is higher since you are always passing the growing session context. However, this can potentially help with transcription.
Using only latest user turn: 3-5x (if transcription cost is 0.001$/session, realtime transcription will be 0.003$/session)
- The cost is lower since you are only transcribing the latest user audio turn. However, you no longer have access to the session context for transcription quality.
Using 1 < N (turn) < Full Context, the price would be between 3-20x more expensive depending on how many turns you decide to keep in context
Note: These cost estimates are specific to the examples covered in this cookbook. Actual costs may vary depending on factors such as session length, how often context is cached, the ratio of audio to text input, and the details of your particular use case.
Other Considerations:
- Implementing transcription via the Realtime model might be slightly more complex compared to using the built-in GPT-4o transcription option through the Realtime API.
Note: Out-of-band responses using the Realtime model can be used for other use cases beyond user turn transcription. Examples include generating structured summaries, triggering background actions, or performing validation tasks without affecting the main conversation.
2. Requirements & Setup
Ensure your environment meets these requirements:
Python 3.10 or later
PortAudio (required by
sounddevice):- macOS:
brew install portaudio
- macOS:
Python Dependencies:
pip install sounddevice websocketsOpenAI API Key (with Realtime API access):
Set your key as an environment variable:export OPENAI_API_KEY=sk-...
3. Prompts
We use two distinct prompts:
- Voice Agent Prompt (
REALTIME_MODEL_PROMPT): This is an example prompt used with the Realtime model for the Speech 2 Speech interactions. - Transcription Prompt (
REALTIME_MODEL_TRANSCRIPTION_PROMPT): Silently returns a precise, verbatim transcript of the user's most recent speech turn. You can modify this prompt to iterate in transcription quality.
For the REALTIME_MODEL_TRANSCRIPTION_PROMPT, you can start from this base prompt, but the goal would be for you to iterate on the prompt to tailor it to your use case. Just remember to remove the Policy Number formatting rules since it might not apply to your use case!
4. Core configuration
We define:
- Imports
- Audio and model defaults
- Constants for transcription event handling
5. Building the Realtime session & the out‑of‑band request
The Realtime session (session.update) configures:
- Audio input/output
- Server‑side VAD
- Set built‑in transcription (
input_audio_transcription_model)- We set this so that we can compare to the Realtime model transcription
The out‑of‑band transcription is a response.create triggered after user input audio is committed input_audio_buffer.committed:
conversation: "none"- use session state but don’t write to the main conversation session stateoutput_modalities: ["text"]- get a text transcript only
Note: The REALTIME_MODEL_TRANSCRIPTION_PROMPT is not passed to the gpt-4o-transcribe model because the Realtime API enforces a 1024 token maximum for prompts.
6. Audio streaming: mic → Realtime → speakers
We now define:
encode_audio- base64 helperplayback_audio- play assistant audio on the default output devicesend_audio_from_queue- send buffered mic audio toinput_audio_bufferstream_microphone_audio- capture PCM16 from the mic and feed the queue
7. Extracting and comparing transcripts
The function below enables us to generate two transcripts for each user turn:
- Realtime model transcript: from our out-of-band
response.createcall. - Built-in ASR transcript: from the standard transcription model (
input_audio_transcription_model).
We align and display both clearly in the terminal:
=== User Turn (Realtime Transcript) ===
...
=== User Turn (Built-in ASR Transcript) ===
...
8. Listening for Realtime events
listen_for_events drives the session:
- Watches for
speech_started/speech_stopped/committed - Sends the out‑of‑band transcription request when a user turn finishes (
input_audio_buffer.committed) when only_last_user_turn == False - Sends the out‑of‑band transcription request when a user turn is added to conversation (
conversation.item.added") when only_last_user_turn == True - Calculates token usage and cost for both transcription methods
- Streams assistant audio to the playback queue
- Buffers text deltas per
response_id
9. Run Script
In this step, we run the code which will allow us to view the Realtime model transcription vs transcription model transcriptions. The code does the following:
- Loads configuration and prompts
- Establishes a WebSocket connection
- Starts concurrent tasks:
listen_for_events(handle incoming messages)stream_microphone_audio(send microphone audio)- Mutes mic when assistant is speaking
playback_audio(play assistant responses)- prints realtime and transcription model transcripts when they are both returned. It uses shared_state to ensure both are returned before printing.
- Run session until you
interrupt
Output should look like:
[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...
=== User turn (Realtime transcript) ===
Hello.
=== User turn (Transcription model) ===
Hello
=== Assistant response ===
Hello, and thank you for calling. Let's start with your full name, please.
From the above example, we can notice:
- The Realtime Model Transcription quality matches or surpasses that of the transcription model in various turns. In one of the turns, the transcription model misses "this is important." while the realtime transcription gets it correctly.
- The Realtime model correctly applies rules for Policy Number formatting (XXXX-XXXX).
- With context from the entire session, including previous turns where I spelled out my name, the Realtime model accurately transcribes my name when the assistant asked my name again while the transcription model makes errors (e.g., "Minhaj ul Haq").
Example with Cost Calculations
There are significant price differences between the available methods for transcribing user audio. GPT-4o-Transcribe is by far the most cost-effective approach: it charges only for the raw audio input and a small amount of text output, resulting in transcripts that cost just fractions of a cent per turn. In contrast, using the Realtime model for out-of-band transcription is more expensive. If you transcribe only the latest user turn with Realtime, it typically costs about 3-5× more than GPT-4o-Transcribe. If you include the full session context in each transcription request, the cost can increase to about 16-20× higher. This is because each request to the Realtime model processes the entire session context again at higher pricing, and the cost grows as the conversation gets longer.
Cost for Transcribing Only the Latest Turn
Let's walk through an example that uses full session context for realtime out-of-band transcription:
Transcription Cost Comparison
Costs Summary
- Realtime Out-of-Band (OOB): $0.040974 total (~$0.006829 per turn)
- Dedicated Transcription: $0.002114 total (~$0.000352 per turn)
- OOB is ~19× more expensive using full session context
Considerations
- Caching: Because these conversations are short, you benefit little from caching beyond the initial system prompt.
- Transcription System Prompt: The transcription model uses a minimal system prompt, so input costs would typically be higher.
Recommended Cost-Saving Strategy
- Limit transcription to recent turns: Minimizing audio/text context significantly reduces OOB transcription costs.
Understanding Cache Behavior
- Effective caching requires stable prompt instructions (usually 1,024+ tokens).
- Different instruction prompts between OOB and main assistant sessions result in separate caches.
Cost for Transcribing Only the Latest Turn
You can limit transcription to only the latest user turn by supplying input item_references like this:
if item_ids:
response["input"] = [
{"type": "item_reference", "id": item_id} for item_id in item_ids
]
return {
"type": "response.create",
"response": response,
}
Transcribing just the most recent user turn lowers costs by restricting the session context sent to the model. However, this approach has trade-offs: the model won’t have access to previous conversation history to help resolve ambiguities or correct errors (for example, accurately recalling a username mentioned earlier). Additionally, because you’re always updating which input is referenced, little caching benefit is realized, the cache prefix changes each turn, so you don’t accumulate reusable context.
Now, let’s look at a second example that uses only the most recent user audio turn for realtime out-of-band transcription:
Cost Analysis Summary
Realtime Out-of-Band Transcription (OOB)
- Total Cost: $0.013354
- Average per Turn: ~$0.001908
Dedicated Transcription Model
- Total Cost: $0.002630
- Average per Turn: ~$0.000376
Difference in Costs
- Additional cost using OOB: +$0.010724
- Cost Multiplier: OOB is about 5× more expensive than the dedicated transcription model.
This approach costs significantly less than using the full session context. You should evaluate your use case to decide whether regular transcription, out-of-band transcription with full context, or transcribing only the latest turn best fits your needs. You can also choose an intermediate strategy, such as including just the last N turns in the input.
Conclusion
Exploring out-of-band transcription could be beneficial for your use case if:
- You're still experiencing unreliable transcriptions, even after optimizing the transcription model prompt.
- You need a more reliable and steerable method for generating transcriptions.
- The current transcripts fail to normalize entities correctly, causing downstream issues.
Keep in mind the trade-offs:
- Cost: Out-of-band (OOB) transcription is more expensive. Be sure that the extra expense makes sense for your typical session lengths and business needs.
- Complexity: Implementing OOB transcription takes extra engineering effort to connect all the pieces correctly. Only choose this approach if its benefits are important for your use case.
If you decide to pursue this method, make sure you:
- Set up the transcription trigger correctly, ensuring it activates after the audio commit.
- Carefully iterate and refine the prompt to align closely with your specific use case and needs.
Documentation:
- https://platform.openai.com/docs/guides/realtime-conversations#create-responses-outside-the-default-conversation
- https://platform.openai.com/docs/api-reference/realtime-client-events/response/create#realtime_client_events-response-create-response-conversation
- https://platform.openai.com/docs/api-reference/realtime-client-events/response/create#realtime_client_events-response-create-response-output_modalities
Step 1: Step 1: Why use out-of-band transcription?
Understand the advantages of using the Realtime model for out-of-band transcription instead of separate ASR models. Key benefits include: reduced mismatch between transcription and generation using the same model, greater steerability with custom instructions, and session context awareness for improved accuracy. Consider trade-offs: Realtime transcription costs approximately $48 per 1M tokens vs $16 for GPT-4o transcription, but provides better quality and consistency.
Step 2: Step 2: Requirements & Setup
Ensure your environment meets requirements: Python 3.10 or later, PortAudio (brew install portaudio on macOS), Python dependencies (pip install sounddevice websockets), and OpenAI API Key with Realtime API access. Set your API key as an environment variable: export OPENAI_API_KEY=sk-...
Step 3: Step 3: Define Prompts
Create two distinct prompts: (1) Voice Agent Prompt (REALTIME_MODEL_PROMPT) for Speech-to-Speech interactions, and (2) Transcription Prompt (REALTIME_MODEL_TRANSCRIPTION_PROMPT) that silently returns a precise, verbatim transcript of the user's most recent speech turn. Iterate on the transcription prompt to tailor it to your specific use case.
Step 4: Step 4: Core Configuration
Define core configuration including imports, audio and model defaults, and constants for transcription event handling.
Step 5: Step 5: Build Realtime Session & Out-of-Band Request
Configure the Realtime session (session.update) with audio input/output, server-side VAD, and built-in transcription. Trigger out-of-band transcription via response.create after user input audio is committed (input_audio_buffer.committed) using conversation: 'none' and output_modalities: ['text'] to avoid writing to main conversation state.
Step 6: Step 6: Audio Streaming Setup
Define audio streaming functions: encode_audio (base64 helper), playback_audio (play assistant audio on default output device), send_audio_from_queue (send buffered mic audio to input_audio_buffer), and stream_microphone_audio (capture PCM16 from mic and feed the queue).
Step 7: Step 7: Extract and Compare Transcripts
Generate two transcripts for each user turn: Realtime model transcript from out-of-band response.create call, and built-in ASR transcript from standard transcription model. Align and display both clearly in terminal output for comparison.
Step 8: Step 8: Listen for Realtime Events
Implement listen_for_events function to drive the session: watch for speech_started/speech_stopped/committed events, send out-of-band transcription request when user turn finishes, calculate token usage and cost for both transcription methods, stream assistant audio to playback queue, and buffer text deltas per response_id.
Step 9: Step 9: Run Script
Execute the code to view Realtime model transcription vs transcription model transcriptions. The script loads configuration and prompts, establishes WebSocket connection, starts concurrent tasks (listen_for_events, stream_microphone_audio, playback_audio), mutes mic when assistant speaks, and prints both transcripts when returned. Run until interrupted.
Discussion
Questions & comments · 0
Sign In Sign in to leave a comment.