Prompt Chain

Transcribe Audio with Realtime Model

Use the OpenAI Realtime model for superior, context-aware audio transcription, ensuring accuracy for downstream tasks and avoiding mismatches with separate

Works with openai

91
Spark score
out of 100
Updated 3 months ago
Version 1.0.0
Models

Add to Favorites

Why it matters

Achieve highly accurate, context-aware audio transcription by leveraging the OpenAI Realtime model itself, bypassing the limitations of separate transcription services for more reliable downstream actions.

Outcomes

What it gets done

01

Stream microphone audio to an OpenAI Realtime voice agent.

02

Generate high-quality text transcripts using the same Realtime model.

03

Ensure transcription accuracy by utilizing full session context.

04

Avoid inconsistencies between spoken words and agent interpretation.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/oai-realtimeoutofbandtranscription | bash

Steps

Steps in the chain

01
Step 1: Why use out-of-band transcription?

Understand the advantages of using the Realtime model for out-of-band transcription instead of separate ASR models. Key benefits include: reduced mismatch between transcription and generation using the same model, greater steerability with custom instructions, and session context awareness for improved accuracy. Consider trade-offs: Realtime transcription costs approximately $48 per 1M tokens vs $16 for GPT-4o transcription, but provides better quality and consistency.

02
Step 2: Requirements & Setup

Ensure your environment meets requirements: Python 3.10 or later, PortAudio (brew install portaudio on macOS), Python dependencies (pip install sounddevice websockets), and OpenAI API Key with Realtime API access. Set your API key as an environment variable: export OPENAI_API_KEY=sk-...

03
Step 3: Define Prompts

Create two distinct prompts: (1) Voice Agent Prompt (REALTIME_MODEL_PROMPT) for Speech-to-Speech interactions, and (2) Transcription Prompt (REALTIME_MODEL_TRANSCRIPTION_PROMPT) that silently returns a precise, verbatim transcript of the user's most recent speech turn. Iterate on the transcription prompt to tailor it to your specific use case.

04
Step 4: Core Configuration

Define core configuration including imports, audio and model defaults, and constants for transcription event handling.

05
Step 5: Build Realtime Session & Out-of-Band Request

Configure the Realtime session (session.update) with audio input/output, server-side VAD, and built-in transcription. Trigger out-of-band transcription via response.create after user input audio is committed (input_audio_buffer.committed) using conversation: 'none' and output_modalities: ['text'] to avoid writing to main conversation state.

06
Step 6: Audio Streaming Setup

Define audio streaming functions: encode_audio (base64 helper), playback_audio (play assistant audio on default output device), send_audio_from_queue (send buffered mic audio to input_audio_buffer), and stream_microphone_audio (capture PCM16 from mic and feed the queue).

07
Step 7: Extract and Compare Transcripts

Generate two transcripts for each user turn: Realtime model transcript from out-of-band response.create call, and built-in ASR transcript from standard transcription model. Align and display both clearly in terminal output for comparison.

08
Step 8: Listen for Realtime Events

Implement listen_for_events function to drive the session: watch for speech_started/speech_stopped/committed events, send out-of-band transcription request when user turn finishes, calculate token usage and cost for both transcription methods, stream assistant audio to playback queue, and buffer text deltas per response_id.

09
Step 9: Run Script

Execute the code to view Realtime model transcription vs transcription model transcriptions. The script loads configuration and prompts, establishes WebSocket connection, starts concurrent tasks (listen_for_events, stream_microphone_audio, playback_audio), mutes mic when assistant speaks, and prints both transcripts when returned. Run until interrupted.

Overview

Transcribing User Audio with a Separate Realtime Request

What it does

**Purpose**: This notebook demonstrates how to use the Realtime model itself to accurately transcribe user audio `out-of-band` using the same websocket session connection, avoiding errors and inconsistencies common when relying on a separate transcription model (gpt-4o-transcribe/whisper-1).

Source README

Transcribing User Audio with a Separate Realtime Request

Purpose: This notebook demonstrates how to use the Realtime model itself to accurately transcribe user audio out-of-band using the same websocket session connection, avoiding errors and inconsistencies common when relying on a separate transcription model (gpt-4o-transcribe/whisper-1).

We call this out-of-band transcription using the Realtime model. It’s simply a second response.create request on the same Realtime WebSocket, tagged so it doesn’t write back to the active conversation state. The model runs again with a different set of instructions (a transcription prompt), triggering a new inference pass that’s separate from the assistant’s main speech turn.

It covers how to build a server-to-server client that:

  • Streams microphone audio to an OpenAI Realtime voice agent.
  • Plays back the agent's spoken replies.
  • After each user turn, generates a high-quality text-only transcript using the same Realtime model.

This is achieved via a secondary response.create request:

{
    "type": "response.create",
    "response": {
        "conversation": "none",
        "output_modalities": ["text"],
        "instructions": transcription_instructions
    }
}

This notebook demonstrates using the Realtime model itself for transcription:

  • Context-aware transcription: Uses the full session context to improve transcript accuracy.
  • Non-intrusive: Runs outside the live conversation, so the transcript is never added back to session state.
  • Customizable instructions: Allows tailoring transcription prompts to specific use-cases. Realtime model is better than the transcription model at following instructions.

1. Why use out-of-band transcription?

The Realtime API offers built-in user input transcription, but this relies on a separate ASR model (e.g., gpt-4o-transcribe). Using different models for transcription and response generation can lead to discrepancies. For example:

  • User speech transcribed as: I had otoo accident
  • Realtime response interpreted correctly as: Got it, you had an auto accident

Accurate transcriptions can be very important, particularly when:

  • Transcripts trigger downstream actions (e.g., tool calls), where errors propagate through the system.
  • Transcripts are summarized or passed to other components, risking context pollution.
  • Transcripts are displayed to end users, leading to poor user experiences if errors occur.

The potential advantages of using out-of-band transcription include:

  • Reduced Mismatch: The same model is used for both transcription and generation, minimizing inconsistencies between what the user says and how the agent responds.
  • Greater Steerability: The Realtime model is more steerable, can better follow custom instructions for higher transcription quality, and is not limited by a 1024-token input maximum.
  • Session Context Awareness: The model has access to the full session context, so, for example, if you mention your name multiple times, it will transcribe it correctly.

In terms of trade-offs:

  • Realtime Model (for transcription):

    • Audio Input → Text Output: $32.00 per 1M audio tokens + $16.00 per 1M text tokens out.

    • Cached Session Context: $0.40 per 1M cached context tokens.

    • Total Cost (for 1M audio tokens in + 1M text tokens out): ≈ $48.00

  • GPT-4o Transcription:

    • Audio Input: $6.00 per 1M audio tokens

    • Text Input: $2.50 per 1M tokens.

    • Text Output: $10.00 per 1M tokens

    • Total Cost (for 1M audio tokens in + 1M text tokens out): ≈ $16.00

  • Direct Cost Comparison (see examples in the end of the cookbook):

    • Using full session context: 16-22x (if transcription cost is 0.001$/session, realtime transcription will be 0.016$/session)

      • The cost is higher since you are always passing the growing session context. However, this can potentially help with transcription.
    • Using only latest user turn: 3-5x (if transcription cost is 0.001$/session, realtime transcription will be 0.003$/session)

      • The cost is lower since you are only transcribing the latest user audio turn. However, you no longer have access to the session context for transcription quality.
    • Using 1 < N (turn) < Full Context, the price would be between 3-20x more expensive depending on how many turns you decide to keep in context

    • Note: These cost estimates are specific to the examples covered in this cookbook. Actual costs may vary depending on factors such as session length, how often context is cached, the ratio of audio to text input, and the details of your particular use case.

  • Other Considerations:

    • Implementing transcription via the Realtime model might be slightly more complex compared to using the built-in GPT-4o transcription option through the Realtime API.

Note: Out-of-band responses using the Realtime model can be used for other use cases beyond user turn transcription. Examples include generating structured summaries, triggering background actions, or performing validation tasks without affecting the main conversation.

drawing

2. Requirements & Setup

Ensure your environment meets these requirements:

  1. Python 3.10 or later

  2. PortAudio (required by sounddevice):

    • macOS:
      brew install portaudio
      
  3. Python Dependencies:

    pip install sounddevice websockets
    
  4. OpenAI API Key (with Realtime API access):
    Set your key as an environment variable:

    export OPENAI_API_KEY=sk-...
    

3. Prompts

We use two distinct prompts:

  1. Voice Agent Prompt (REALTIME_MODEL_PROMPT): This is an example prompt used with the Realtime model for the Speech 2 Speech interactions.
  2. Transcription Prompt (REALTIME_MODEL_TRANSCRIPTION_PROMPT): Silently returns a precise, verbatim transcript of the user's most recent speech turn. You can modify this prompt to iterate in transcription quality.

For the REALTIME_MODEL_TRANSCRIPTION_PROMPT, you can start from this base prompt, but the goal would be for you to iterate on the prompt to tailor it to your use case. Just remember to remove the Policy Number formatting rules since it might not apply to your use case!

4. Core configuration

We define:

  • Imports
  • Audio and model defaults
  • Constants for transcription event handling

5. Building the Realtime session & the out‑of‑band request

The Realtime session (session.update) configures:

  • Audio input/output
  • Server‑side VAD
  • Set built‑in transcription (input_audio_transcription_model)
    • We set this so that we can compare to the Realtime model transcription

The out‑of‑band transcription is a response.create triggered after user input audio is committed input_audio_buffer.committed:

Note: The REALTIME_MODEL_TRANSCRIPTION_PROMPT is not passed to the gpt-4o-transcribe model because the Realtime API enforces a 1024 token maximum for prompts.

6. Audio streaming: mic → Realtime → speakers

We now define:

  • encode_audio - base64 helper
  • playback_audio - play assistant audio on the default output device
  • send_audio_from_queue - send buffered mic audio to input_audio_buffer
  • stream_microphone_audio - capture PCM16 from the mic and feed the queue

7. Extracting and comparing transcripts

The function below enables us to generate two transcripts for each user turn:

  • Realtime model transcript: from our out-of-band response.create call.
  • Built-in ASR transcript: from the standard transcription model (input_audio_transcription_model).

We align and display both clearly in the terminal:

=== User Turn (Realtime Transcript) ===
...

=== User Turn (Built-in ASR Transcript) ===
...

8. Listening for Realtime events

listen_for_events drives the session:

  • Watches for speech_started / speech_stopped / committed
  • Sends the out‑of‑band transcription request when a user turn finishes (input_audio_buffer.committed) when only_last_user_turn == False
  • Sends the out‑of‑band transcription request when a user turn is added to conversation (conversation.item.added") when only_last_user_turn == True
  • Calculates token usage and cost for both transcription methods
  • Streams assistant audio to the playback queue
  • Buffers text deltas per response_id

9. Run Script

In this step, we run the code which will allow us to view the Realtime model transcription vs transcription model transcriptions. The code does the following:

  • Loads configuration and prompts
  • Establishes a WebSocket connection
  • Starts concurrent tasks:
    • listen_for_events (handle incoming messages)
    • stream_microphone_audio (send microphone audio)
    • Mutes mic when assistant is speaking
    • playback_audio (play assistant responses)
    • prints realtime and transcription model transcripts when they are both returned. It uses shared_state to ensure both are returned before printing.
  • Run session until you interrupt

Output should look like:

[client] Speech detected; streaming...
[client] Detected silence; preparing transcript...

=== User turn (Realtime transcript) ===
Hello.

=== User turn (Transcription model) ===
Hello


=== Assistant response ===
Hello, and thank you for calling. Let's start with your full name, please.

From the above example, we can notice:

  • The Realtime Model Transcription quality matches or surpasses that of the transcription model in various turns. In one of the turns, the transcription model misses "this is important." while the realtime transcription gets it correctly.
  • The Realtime model correctly applies rules for Policy Number formatting (XXXX-XXXX).
  • With context from the entire session, including previous turns where I spelled out my name, the Realtime model accurately transcribes my name when the assistant asked my name again while the transcription model makes errors (e.g., "Minhaj ul Haq").

Example with Cost Calculations

There are significant price differences between the available methods for transcribing user audio. GPT-4o-Transcribe is by far the most cost-effective approach: it charges only for the raw audio input and a small amount of text output, resulting in transcripts that cost just fractions of a cent per turn. In contrast, using the Realtime model for out-of-band transcription is more expensive. If you transcribe only the latest user turn with Realtime, it typically costs about 3-5× more than GPT-4o-Transcribe. If you include the full session context in each transcription request, the cost can increase to about 16-20× higher. This is because each request to the Realtime model processes the entire session context again at higher pricing, and the cost grows as the conversation gets longer.

Cost for Transcribing Only the Latest Turn

Let's walk through an example that uses full session context for realtime out-of-band transcription:

Transcription Cost Comparison
Costs Summary
  • Realtime Out-of-Band (OOB): $0.040974 total (~$0.006829 per turn)
  • Dedicated Transcription: $0.002114 total (~$0.000352 per turn)
  • OOB is ~19× more expensive using full session context
Considerations
  • Caching: Because these conversations are short, you benefit little from caching beyond the initial system prompt.
  • Transcription System Prompt: The transcription model uses a minimal system prompt, so input costs would typically be higher.
Recommended Cost-Saving Strategy
  • Limit transcription to recent turns: Minimizing audio/text context significantly reduces OOB transcription costs.
Understanding Cache Behavior
  • Effective caching requires stable prompt instructions (usually 1,024+ tokens).
  • Different instruction prompts between OOB and main assistant sessions result in separate caches.

Cost for Transcribing Only the Latest Turn

You can limit transcription to only the latest user turn by supplying input item_references like this:

    if item_ids:
        response["input"] = [
            {"type": "item_reference", "id": item_id} for item_id in item_ids
        ]

    return {
        "type": "response.create",
        "response": response,
    }

Transcribing just the most recent user turn lowers costs by restricting the session context sent to the model. However, this approach has trade-offs: the model won’t have access to previous conversation history to help resolve ambiguities or correct errors (for example, accurately recalling a username mentioned earlier). Additionally, because you’re always updating which input is referenced, little caching benefit is realized, the cache prefix changes each turn, so you don’t accumulate reusable context.

Now, let’s look at a second example that uses only the most recent user audio turn for realtime out-of-band transcription:

Cost Analysis Summary

Realtime Out-of-Band Transcription (OOB)

  • Total Cost: $0.013354
  • Average per Turn: ~$0.001908

Dedicated Transcription Model

  • Total Cost: $0.002630
  • Average per Turn: ~$0.000376

Difference in Costs

  • Additional cost using OOB: +$0.010724
  • Cost Multiplier: OOB is about more expensive than the dedicated transcription model.

This approach costs significantly less than using the full session context. You should evaluate your use case to decide whether regular transcription, out-of-band transcription with full context, or transcribing only the latest turn best fits your needs. You can also choose an intermediate strategy, such as including just the last N turns in the input.

Conclusion

Exploring out-of-band transcription could be beneficial for your use case if:

  • You're still experiencing unreliable transcriptions, even after optimizing the transcription model prompt.
  • You need a more reliable and steerable method for generating transcriptions.
  • The current transcripts fail to normalize entities correctly, causing downstream issues.

Keep in mind the trade-offs:

  • Cost: Out-of-band (OOB) transcription is more expensive. Be sure that the extra expense makes sense for your typical session lengths and business needs.
  • Complexity: Implementing OOB transcription takes extra engineering effort to connect all the pieces correctly. Only choose this approach if its benefits are important for your use case.

If you decide to pursue this method, make sure you:

  • Set up the transcription trigger correctly, ensuring it activates after the audio commit.
  • Carefully iterate and refine the prompt to align closely with your specific use case and needs.

Documentation:

Step 1: Step 1: Why use out-of-band transcription?

Understand the advantages of using the Realtime model for out-of-band transcription instead of separate ASR models. Key benefits include: reduced mismatch between transcription and generation using the same model, greater steerability with custom instructions, and session context awareness for improved accuracy. Consider trade-offs: Realtime transcription costs approximately $48 per 1M tokens vs $16 for GPT-4o transcription, but provides better quality and consistency.

Step 2: Step 2: Requirements & Setup

Ensure your environment meets requirements: Python 3.10 or later, PortAudio (brew install portaudio on macOS), Python dependencies (pip install sounddevice websockets), and OpenAI API Key with Realtime API access. Set your API key as an environment variable: export OPENAI_API_KEY=sk-...

Step 3: Step 3: Define Prompts

Create two distinct prompts: (1) Voice Agent Prompt (REALTIME_MODEL_PROMPT) for Speech-to-Speech interactions, and (2) Transcription Prompt (REALTIME_MODEL_TRANSCRIPTION_PROMPT) that silently returns a precise, verbatim transcript of the user's most recent speech turn. Iterate on the transcription prompt to tailor it to your specific use case.

Step 4: Step 4: Core Configuration

Define core configuration including imports, audio and model defaults, and constants for transcription event handling.

Step 5: Step 5: Build Realtime Session & Out-of-Band Request

Configure the Realtime session (session.update) with audio input/output, server-side VAD, and built-in transcription. Trigger out-of-band transcription via response.create after user input audio is committed (input_audio_buffer.committed) using conversation: 'none' and output_modalities: ['text'] to avoid writing to main conversation state.

Step 6: Step 6: Audio Streaming Setup

Define audio streaming functions: encode_audio (base64 helper), playback_audio (play assistant audio on default output device), send_audio_from_queue (send buffered mic audio to input_audio_buffer), and stream_microphone_audio (capture PCM16 from mic and feed the queue).

Step 7: Step 7: Extract and Compare Transcripts

Generate two transcripts for each user turn: Realtime model transcript from out-of-band response.create call, and built-in ASR transcript from standard transcription model. Align and display both clearly in terminal output for comparison.

Step 8: Step 8: Listen for Realtime Events

Implement listen_for_events function to drive the session: watch for speech_started/speech_stopped/committed events, send out-of-band transcription request when user turn finishes, calculate token usage and cost for both transcription methods, stream assistant audio to playback queue, and buffer text deltas per response_id.

Step 9: Step 9: Run Script

Execute the code to view Realtime model transcription vs transcription model transcriptions. The script loads configuration and prompts, establishes WebSocket connection, starts concurrent tasks (listen_for_events, stream_microphone_audio, playback_audio), mutes mic when assistant speaks, and prints both transcripts when returned. Run until interrupted.

Discussion

Questions & comments · 0

Sign In Sign in to leave a comment.