Process Images with Azure Computer Vision
Azure Computer Vision Tool connects AI agents to Azure's Computer Vision API for image captioning, object detection, tagging, and OCR on image URLs.
Why it matters
Leverage Azure's advanced computer vision capabilities to analyze images. This tool enables agents to extract insights like captions, tags, and text from image URLs.
Outcomes
What it gets done
Caption images using Azure Computer Vision.
Extract tags and objects from images.
Perform Optical Character Recognition (OCR) on images.
Integrate Azure Computer Vision into agent workflows.
Install
Add it to your toolbox
Run in your project directory:
curl -fsSL https://spark.entire.vc/get/li-tool-tools-azure-cv | bash Capabilities
What this skill does
Labels or categorizes text, files, or data points.
Pulls structured data fields from unstructured text.
Converts audio or video speech to written text.
Condenses long documents or threads into key takeaways.
Overview
Azure Computer Vision Tool
What it does
Azure Computer Vision Tool connects LlamaIndex agents to Azure's Computer Vision API. It provides a `process_image` function that performs computer vision tasks including image captioning, object classification, tag generation, and optical character recognition (OCR) on image URLs. The tool wraps Azure Cognitive Services as a ToolSpec that agents can invoke through natural language requests.
How it connects
Use this tool when your AI agent needs to analyze images as part of its workflow-captioning photos, extracting text from screenshots, identifying objects in product images, or generating descriptive tags. It's designed for scenarios where you already use Azure infrastructure and want to add vision capabilities to LlamaIndex agents without building custom API integrations.
Source README
Azure Computer Vision Tool
This tool connects to a Azure account and allows an Agent to perform a variety of computer vision tasks on image urls.
You will need to set up an api key and computer vision instance using Azure, learn more here: https://azure.microsoft.com/en-ca/products/cognitive-services/computer-vision
Usage
This tool has a more extensive example usage documented in a Jupyter notebook here
Here's an example usage of the AzureCVToolSpec.
from llama_index.tools.azure_cv import AzureCVToolSpec
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI
tool_spec = AzureCVToolSpec(api_key="your-key", resource="your-resource")
agent = FunctionAgent(
tools=tool_spec.to_tool_list(), llm=OpenAI(model="gpt-4.1")
)
await agent.run(
"caption this image and tell me what tags are in it https://portal.vision.cognitive.azure.com/dist/assets/ImageCaptioningSample1-bbe41ac5.png"
)
await agent.run(
"caption this image and read any text https://portal.vision.cognitive.azure.com/dist/assets/OCR3-4782f088.jpg"
)
process_image: Send an image for computer vision classification of objects, tags, captioning or OCR.
This loader is designed to be used as a way to load data as a Tool in a Agent.
Discussion
Questions & comments · 0
Sign In Sign in to leave a comment.