Integrate Azure Speech Services for Text and Audio
Azure Speech Tool enables AI agents to transcribe .wav audio files to text and synthesize text into spoken audio using Microsoft Azure AI speech services.
Why it matters
Leverage Microsoft Azure's advanced speech services to enable agents to transcribe audio files into text and generate audio files from text, streamlining content creation and data processing.
Outcomes
What it gets done
Transcribe audio files (.wav) into text using Azure Speech-to-Text.
Synthesize audio from input text using Azure Text-to-Speech.
Integrate speech capabilities into agent workflows for automated tasks.
Install
Add it to your toolbox
Run in your project directory:
curl -fsSL https://spark.entire.vc/get/li-tool-tools-azure-speech | bash Capabilities
What this skill does
Converts audio or video speech to written text.
Drafts marketing, email, or product copy on demand.
Condenses long documents or threads into key takeaways.
Overview
Azure Speech Tool
What it does
Azure Speech Tool integrates Microsoft Azure AI speech services into LlamaIndex agents. It provides two functions: `speech_to_text` transcribes .wav audio files into text, and `text_to_speech` synthesizes audio from input strings and plays it on the user's computer. The tool is packaged as a ToolSpec that converts to a tool list for agent workflows.
How it connects
Use this tool when building AI agents that need to process audio input (like transcribing meeting recordings or voice memos) or generate spoken responses. It's ideal for voice-enabled assistants, audio content summarization workflows, or any agent that needs to interact through speech rather than text alone.
Source README
Azure Speech Tool
This tool allows Agents to use Microsoft Azure speech services to transcribe audio files to text, and create audio files from text. To see more and get started, visit https://azure.microsoft.com/en-us/products/ai-services/ai-speech
Usage
This tool has a more extensive example usage documented in a Jupyter notebook here
from llama_index.tools.azure_speech import AzureSpeechToolSpec
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI
speech_tool = AzureSpeechToolSpec(speech_key="your-key", region="eastus")
agent = FunctionAgent(
tools=speech_tool.to_tool_list(),
llm=OpenAI(model="gpt-4.1"),
)
print(await agent.run('Say "hello world"'))
print(
await agent.run(
"summarize the data/speech.wav audio file into a few sentences"
)
)
text_to_speech: Takes an input string and synthesizes audio to play on the users computerspeech_to_text: Takes a .wav file and transcribes it into text
This loader is designed to be used as a way to load data as a Tool in a Agent.
Discussion
Questions & comments · 0
Sign In Sign in to leave a comment.