What exactly does an MCP server do?

An MCP server provides access to HuggingFace computer vision models for zero-shot object detection, supporting both GPU and CPU, and allows for deployment via Docker.

What is a MCP server for visual code?

An MCP server provides access to HuggingFace computer vision models for zero-shot object detection, supporting both GPU and CPU, and allows for Docker deployment.

Does ChatGPT use MCP?

The context does not provide information about ChatGPT using MCP. It only mentions MCP server access to HuggingFace models for zero-shot object detection.

What is a MCP server for dummies?

An MCP server is a system that allows users to access HuggingFace's computer vision models for detecting objects without needing prior training. It supports both GPU and CPU, and can be deployed using Docker for easy setup.

How to say "I love you" in C++?

The context does not provide information on how to say 'I love you' in C++. Please refer to C++ programming resources for string manipulation and output methods.

MCP

Integrate Vision Models with LLMs

Name: Integrate Vision Models with LLMs
Availability: OnlineOnly
Author: Groundlight AI

MCP server providing access to HuggingFace computer vision models for zero-shot object detection, with GPU/CPU support and Docker deployment.

Connect

Works with huggingface docker

⚠️ This tool looks unmaintained — no upstream commits in 12+ months.

Groundlight AI

Maintainer?

Spark score

out of 100

Updated May 2025

Version 1.0.0

Models

claude

Add to Favorites

Why it matters

Extend large language models and vision-language models with computer vision capabilities. Access HuggingFace models for tasks like zero-shot object detection and image analysis.

Outcomes

What it gets done

Detect and localize objects in images using zero-shot object detection.

Crop images to focus on detected objects for detailed analysis.

Integrate computer vision models as tools for LLMs.

Deploy vision models via Docker for CPU or GPU execution.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/vb-mcp-vision | bash

Capabilities

Tools your agent gets

locate_objects

Detect and localize objects in an image using zero-shot object detection from HuggingFace models.

zoom_to_object

Zoom into an object in an image by cropping to the object's bounding box for detailed analysis.

Overview

mcp-vision MCP Server

What it does

This connector exposes HuggingFace computer vision models through the Model Context Protocol, offering two tools: `locate_objects` for detecting and localizing objects using zero-shot object detection pipelines, and `zoom_to_object` for cropping images to detected object bounding boxes.

How it connects

Use this server when you need to add object detection and localization capabilities to MCP clients like Claude Desktop. It's particularly useful for analyzing images to identify and zoom into specific objects without requiring model training on specific object classes. The repository is in active development; when running on CPU, the default large model may take considerable time to load and perform inference.

Source README

mcp-vision by

A Model Context Protocol (MCP) server exposing HuggingFace computer vision models such as zero-shot object detection as tools, enhancing the vision capabilities of large language or vision-language models.

This repo is in active development. See below for details of currently available tools.

Installation

Clone the repo:

git clone git@github.com:groundlight/mcp-vision.git

Build a local docker image:

cd mcp-vision
make build-docker

Configuring Claude Desktop

Add this to your claude_desktop_config.json:

If your local environment has access to a NVIDIA GPU:

"mcpServers": {
  "mcp-vision": {
    "command": "docker",
    "args": ["run", "-i", "--rm", "--runtime=nvidia", "--gpus", "all", "mcp-vision"],
	"env": {}
  }
}

Or, CPU only:

"mcpServers": {
  "mcp-vision": {
    "command": "docker",
    "args": ["run", "-i", "--rm", "mcp-vision"],
	"env": {}
  }
}

When running on CPU, the default large-size object detection model make take a long time to laod and run inference. Consider using a smaller model as DEFAULT_OBJDET_MODEL (you can tell Claude directly to use a specific model too).

(Beta) It is possible to run the public docker image directly without building locally, however the download time may interfere with Claude's loading of the server.

"mcpServers": {
  "mcp-vision": {
    "command": "docker",
    "args": ["run", "-i", "--rm", "--runtime=nvidia", "--gpus", "all", "groundlight/mcp-vision:latest"],
	"env": {}
  }
}

Tools

The following tools are currently available through the mcp-vision server:

locate_objects

Description: Detect and locate objects in an image using one of the zero-shot object detection pipelines available
through HuggingFace (list for reference [https://huggingface.co/models?pipeline_tag=zero-shot-object-detection&sort=trending]).
Input: image_path (string) URL or file path, candidate_labels (list of strings) list of possible objects to detect, hf_model (optional string), will use "google/owlvit-large-patch14" by default, which could be slow on a non-GPU machine
Returns: List of dicts in HF object-detection format

zoom_to_object

Description: Zoom into an object in the image, allowing you to analyze it more closely. Crop image to the object bounding box and return the cropped image. If many objects are present in the image, will return the 'best' one as represented by object score.
Input: image_path (string) URL or file path, label (string) object label to find and zoom and crop to, hf_model (optional), will use "google/owlvit-large-patch14" by default, which could be slow on a non-GPU machine
Returns: MCPImage or None

Example in blog post and video

Run Claude Desktop with Claude Sonnet 3.7 and mcp-vision configured as an MCP server in claude_desktop_config.json.

The prompt used in the example video and blog post was:

From the information on that advertising board, what is the type of this shop?
Options:
The shop is a yoga studio.
The shop is a cafe.
The shop is a seven-eleven.
The shop is a milk tea shop.

The image is the first image in the V*Bench/GPT4V-hard dataset and can be found here: https://huggingface.co/datasets/craigwu/vstar_bench/blob/main/GPT4V-hard/0.JPG (use the download link).

Note:

If you upload the image directly into the conversation with Claude instead of providing a download link, it will not be able to call the tools and will attempt to answer directly.
On accounts that have web search enabled, Claude will prefer to use web search over local MCP tools AFAIK. Disable web search for best results.

Development

Run locally using the uv package manager:

uv install
uv run python mcp_vision

Build the Docker image locally:

make build-docker

Run the Docker image locally:

make run-docker-cpu

make run-docker-gpu

[Groundlight Internal] Push the Docker image to Docker Hub (requires DockerHub credentials):

make push-docker

Troubleshooting

If Claude Desktop is failing to connect to mcp-vision:

Check the configuration is correct (CPU vs GPU)
Developer options may need to be enabled in Claude Desktop
Depending on the size of the model(s) used, give it a few minutes to download them from HuggingFace on first opening Claude Desktop. Once downloaded, the server will respond and Claude will connect.

On accounts that have web search enabled, Claude will prefer to use web search over local MCP tools AFAIK. Disable web search for best results.

TODO

Host best models online instead of requiring local download
Add more tools

FAQ

Common questions

Discussion