What AI model does this skill use for podcast generation?

This skill uses Azure OpenAI's `gpt-realtime-mini` model accessed via the Realtime API to generate spoken audio narratives from text content.

What audio format and quality does this skill output?

The skill outputs base64-encoded WAV audio at 24kHz sample rate, 16-bit depth, and mono channel configuration.

What environment variables are required to configure this skill?

Three environment variables are required: an API key, an endpoint (the base Cognitive Services URL without `/openai/v1/` appended), and a deployment name.

What does the skill return as output?

The skill returns base64-encoded WAV audio plus a transcript, both generated by streaming the Realtime API's audio and transcript events until the response completes.

Skill

Generate Podcast Audio from Text

Name: Podcast Generation with GPT Realtime Mini
Availability: OnlineOnly
Author: Antigravity

A skill for generating real narrated audio from text using Azure OpenAI's Realtime API and the GPT Realtime Mini model.

Get skill

Works with azure openai

Antigravity

Own this? Claim it

Spark score

out of 100

Updated today

Version 15.7.0

Models

gpt 4o

Add to Favorites

Why it matters

Leverage Azure OpenAI's Realtime API to convert text content into natural-sounding audio narratives, complete with transcripts. This asset handles the full pipeline from text input to playable audio.

Outcomes

What it gets done

Configure Azure OpenAI Realtime API credentials and endpoint.

Stream audio chunks and transcript data via WebSocket.

Convert raw PCM audio to WAV format.

Provide base64-encoded audio for frontend playback.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/ag-podcast-generation | bash

Overview

Podcast Generation with GPT Realtime Mini

Generates narrated audio from text using Azure OpenAI's Realtime API - WebSocket streaming, PCM-to-WAV conversion, and six selectable narrator voices. Use it to generate spoken audio narration from text via Azure OpenAI's Realtime API; requires the base Cognitive Services endpoint without an /openai/v1/ suffix.

What it does

This skill generates real audio narratives from text content using Azure OpenAI's Realtime API and the gpt-realtime-mini model. The workflow connects via WebSocket to the Azure OpenAI Realtime endpoint, sends a text prompt, collects streaming PCM audio chunks and a transcript, converts the PCM stream to WAV format, and returns base64-encoded audio to the frontend for playback.

On the backend, the WebSocket connection is configured for audio-only output with narrator instructions, then a text message is submitted as a conversation item and a response is requested. Streaming events are collected as they arrive: response.output_audio.delta carries base64-encoded audio chunks, response.output_audio_transcript.delta carries transcript text, and response.done signals completion so the loop can break; error events are handled via event.error.message. Collected PCM chunks are joined and converted to WAV at a 24kHz sample rate using a helper script.

On the frontend, the base64 WAV payload is converted into a playable blob and played directly:

const base64ToBlob = (base64, mimeType) => {
  const bytes = atob(base64);
  const arr = new Uint8Array(bytes.length);
  for (let i = 0; i < bytes.length; i++) arr[i] = bytes.charCodeAt(i);
  return new Blob([arr], { type: mimeType });
};

const audioBlob = base64ToBlob(response.audio_data, 'audio/wav');
const audioUrl = URL.createObjectURL(audioBlob);
new Audio(audioUrl).play();

Six voice options are available - alloy (neutral), echo (warm), fable (expressive), onyx (deep), nova (friendly), and shimmer (clear) - covering a range of narrator tones. Audio output is PCM at 24kHz, 16-bit, mono, stored and transmitted as base64-encoded WAV.

When to use - and when NOT to

Use it when you need to generate spoken-audio narration from text content - podcast-style narration, read-alouds, or voice output for an application - using Azure OpenAI's Realtime API rather than a separate text-to-speech service. Configuration requires three environment variables (API key, endpoint, and deployment name), and the endpoint must be the base Cognitive Services URL without an /openai/v1/ suffix appended.

Inputs and outputs

Input is a text prompt to narrate. Output is base64-encoded WAV audio (24kHz/16-bit/mono) plus a transcript, generated by streaming the Realtime API's audio-delta and transcript-delta events until the response completes. The session is explicitly configured for audio-only output modality with narrator instructions before the text is submitted as a conversation item. Further reference material covers the complete stack design and production code patterns, plus a dedicated script for PCM-to-WAV conversion.

Who it's for

Developers building applications that need narrated audio output from text - podcast generation, voice narration features, or accessibility read-aloud - using Azure OpenAI's Realtime API. Full architecture guidance and production-ready code patterns are provided as separate reference material for teams taking this beyond a proof of concept.

FAQ

Common questions

Discussion