Generate Subtitles from Audio and Transcripts
Prompt chain that generates time-aligned subtitles in SRT or VTT format from audio files and transcripts using ElevenLabs forced alignment API.
Why it matters
Automatically generate time-aligned subtitles (SRT/VTT) from audio files and their corresponding transcripts. This asset leverages ElevenLabs' forced alignment capabilities to precisely synchronize spoken words with timestamps.
Outcomes
What it gets done
Process audio and transcript input.
Perform forced alignment using ElevenLabs.
Generate SRT and VTT subtitle files.
Ensure accurate time synchronization for subtitles.
Install
Add it to your toolbox
Run in your project directory:
curl -fsSL https://spark.entire.vc/get/pfoo-alignment | bash Capabilities
What this chain does
Converts audio or video speech to written text.
Pulls structured data fields from unstructured text.
Condenses long documents or threads into key takeaways.
Overview
Alignment
What it does
This prompt chain automates subtitle generation by processing audio files and transcripts through ElevenLabs forced alignment API. It outputs time-synchronized subtitle files in either SRT or VTT format, with word-level timing precision. The workflow handles the technical alignment process that matches transcript text to exact audio timestamps.
How it connects
Use this when you need to add subtitles to videos, podcasts, or other audio content and already have transcripts that need precise timing. It's ideal for content creators, accessibility teams, and media producers who want to automate the time-consuming process of manually syncing captions to audio without sacrificing accuracy.
Source README
provider-elevenlabs/alignment (ElevenLabs Forced Alignment)
Generate time-aligned subtitles (SRT/VTT) from audio and transcripts using ElevenLabs forced alignment.
Quick Start
npx promptfoo@latest init --example provider-elevenlabs/alignment
cd provider-elevenlabs/alignment
export ELEVENLABS_API_KEY=your_api_key_here
npx promptfoo@latest eval
What this tests
- Subtitle generation: Create SRT and VTT subtitle files
- Word-level alignment: Precise timestamp data for each word
- Multiple formats: JSON (raw data), SRT (video players), VTT (web players)
- Accuracy: Verify alignment matches audio timing
How it works
Forced alignment takes two inputs:
- Audio file: Speech recording (MP3, WAV, etc.)
- Transcript: Text of what was spoken
It returns precise timestamps showing when each word was spoken, formatted as subtitles.
Use Cases
- Video subtitles: Generate SRT files for video editing software
- Web captions: Create VTT files for HTML5 video players
- Karaoke apps: Word-level timing for synchronized highlighting
- Accessibility: Auto-generate captions for spoken content
- Translation sync: Time-align translations to original audio
Output Formats
JSON (Raw alignment data)
{
"alignment": [
{ "char": "T", "start": 0.0, "end": 0.1 },
{ "char": "h", "start": 0.1, "end": 0.15 }
],
"characters": "That's one small step..."
}
SRT (Standard video subtitles)
1
00:00:00,000 --> 00:00:02,500
That's one small step for man
2
00:00:02,500 --> 00:00:05,000
one giant leap for mankind
VTT (WebVTT for web players)
WEBVTT
1
00:00:00.000 --> 00:00:02.500
That's one small step for man
2
00:00:02.500 --> 00:00:05.000
one giant leap for mankind
Configuration
Basic alignment (JSON output)
providers:
- id: elevenlabs:alignment:json
label: Alignment (JSON)
tests:
- vars:
audioFile: path/to/audio.mp3
transcript: 'Your transcript text here'
format: json
SRT subtitles
providers:
- id: elevenlabs:alignment:srt
label: Alignment (SRT Subtitles)
tests:
- vars:
audioFile: path/to/audio.mp3
transcript: 'Your transcript text here'
format: srt
VTT subtitles
providers:
- id: elevenlabs:alignment:vtt
label: Alignment (VTT Subtitles)
tests:
- vars:
audioFile: path/to/audio.mp3
transcript: 'Your transcript text here'
format: vtt
Testing Assertions
tests:
# Verify alignment succeeds
- assert:
- type: javascript
value: output.includes('words') # JSON format
- type: not-contains
value: error
# Verify SRT format
- assert:
- type: javascript
value: output.includes('-->') && output.includes('small step')
Best Practices
- Transcript accuracy: Ensure transcript exactly matches spoken audio
- Include punctuation: Better subtitle chunking and timing
- Audio quality: Clear audio produces more accurate timestamps
- Format selection:
- Use SRT for video editing (Premiere, Final Cut, DaVinci)
- Use VTT for web players (HTML5
<video>tag) - Use JSON for custom processing
Cost Information
Forced alignment pricing is based on audio duration:
- ~$0.05 per minute of audio
The provider automatically tracks costs in evaluation results.
Related Examples
- ElevenLabs STT - Speech-to-text transcription
- ElevenLabs Isolation - Audio noise removal
Resources
Discussion
Questions & comments · 0
Sign In Sign in to leave a comment.