Vision Fine-tuning on GPT-4o for Visual Question Answering

We're excited to announce the launch of [Vision Fine-Tuning on GPT-4o](https://openai.com/index/introducing-vision-to-the-fine-tuning-api/), a cutting-edge multimodal fine-tuning capability that empowers developers to fine-tune GPT-4o using both **images** and **text**. With this new feature, you can customize models to have stronger image understanding capabilities, unlocking possibilities across various industries and applications.

Get this prompt chain

Vision Fine-tuning on GPT-4o for Visual Question Answering

We're excited to announce the launch of Vision Fine-Tuning on GPT-4o, a cutting-edge multimodal fine-tuning capability that empowers developers to fine-tune GPT-4o using both images and text. With this new feature, you can customize models to have stronger image understanding capabilities, unlocking possibilities across various industries and applications.

From advanced visual search to improved object detection for autonomous vehicles or smart cities, vision fine-tuning enables you to craft solutions tailored to your specific needs. By combining text and image inputs, this product is uniquely positioned for tasks like visual question answering, where detailed, context-aware answers are derived from analyzing images. In general, this seems to be most effective when the model is presented with questions and images that resemble the training data as we are able to teach the model how to search and identify relevant parts of the image to answer the question correctly. Similarly to fine-tuning on text inputs, vision fine-tuning is not as useful for teaching the model new information.

In this guide, we’ll walk you through the steps to fine-tune GPT-4o with multimodal inputs. Specifically, we’ll demonstrate how to train a model for answering questions related to images of books, but the potential applications span countless domains—from web design and education to healthcare and research.

Whether you're looking to build smarter defect detection models for manufacturing, enhance complex document processing and diagram understanding, or develop applications with better visual comprehension for a variety of other use cases, this guide will show you just how fast and easy it is to get started.

For more information, check out the full Documentation.

Load Dataset

We will work with a dataset of question-answer pairs on images of books from the OCR-VQA dataset, accessible through HuggingFace. This dataset contains 207,572 images of books with associated question-answer pairs inquiring about title, author, edition, year and genre of the book. In total, the dataset contains ~1M QA pairs. For the purposes of this guide, we will only use a small subset of the dataset to train, validate and test our model.

We believe that this dataset will be well suited for fine-tuning on multimodal inputs as it requires the model to not only accurately identify relevant bounding boxes to extract key information, but also reason about the content of the image to answer the question correctly.

We'll begin by sampling 150 training examples, 50 validation examples and 100 test examples. We will also explode the questions and answers columns to create a single QA pair for each row. Additionally, since our images are stored as byte strings, we'll convert them to images for processing.

Let's inspect a random sample from the training set.

In this example, the question prompts the model to determine the title of the book. In this case, the answer is quite ambiguous as there is the main title "Patty's Patterns - Advanced Series Vol. 1 & 2" as well as the subtitle "100 Full-Page Patterns Value Bundle" which are found in different parts of the image. Also, the name of the author here is not an individual, but a group called "Penny Farthing Graphics" which could be mistaken as part of the title.

This type of task is typical in visual question answering, where the model must interpret complex images and provide accurate, context-specific responses. By training on these kinds of questions, we can enhance the model's ability to perform detailed image analysis across a variety of domains.

Data Preparation

To ensure successful fine-tuning of our model, it’s crucial to properly structure the training data. Correctly formatting the data helps avoid validation errors during training and ensures the model can effectively learn from both text and image inputs. The good news is, this process is quite straightforward.

Each example in the training dataset should be a conversation in the same format as the Chat Completions API. Specifically, this means structuring the data as a series of messages, where each message includes a role (such as "user" or "assistant") and the content of the message.

Since we are working with both text and images for vision fine-tuning, we’ll construct these messages to include both content types. For each training sample, the question about the image is presented as a user message, and the corresponding answer is provided as an assistant message.

Images can be included in one of two ways:

  • As HTTP URLs, referencing the location of the image.
  • As data URLs containing the image encoded in base64.

Here’s an example of how the message format should look:

Let's start by defining the system instructions for our model. These instructions provide the model with important context, guiding how it should behave when processing the training data. Clear and concise system instructions are particularly useful to make sure the model reasons well on both text and images.

To ensure our images are properly formatted for vision fine-tuning, they must be in base64 format and either RGB or RGBA. This ensures the model can accurately process the images during training. Below is a function that handles the encoding of images, while also converting them to the correct format if necessary.

This function allows us to control the quality of the image encoding, which can be useful if we want to reduce the size of the file. 100 is the highest quality, and 1 is the lowest. The maximum file size for a fine-tuning job is 1GB, but we are unlikely to see improvements with a very large amount of training data. Nevertheless, we can use the quality parameter to reduce the size of the file if needed to accomodate file size limits.

We will also include Few-Shot examples from the training set as user and assistant messages to help guide the model's reasoning process.

Now that we have our system instructions, few-shot examples, and the image encoding function in place, the next step is to iterate through the training set and construct the messages required for fine-tuning. As a reminder, each training example must be formatted as a conversation and must include both the image (in base64 format) and the corresponding question and answer.

To fine-tune GPT-4o, we recommend providing at least 10 examples, but you’ll typically see noticeable improvements with 50 to 100 training examples. In this case, we'll go all-in and fine-tune the model using our larger training sample of 150 images, and 721 QA pairs.

We save our final training set in a .jsonl file where each line in the file represents a single example in the training dataset.

Just like the training set, we need to structure our validation and test sets in the same message format. However, for the test set, there's a key difference: since the test set is used for evaluation, we do not include the assistant's message (i.e., the answer). This ensures the model generates its own answers, which we can later compare to the ground truth for performance evaluation.

Fine-tuning

Now that we have prepared our training and validation datasets in the right format, we can upload them using the Files API for fine-tuning.

Once the files are uploaded, we're ready to proceed to the next step: starting the fine-tuning job.

To create a fine-tuning job, we use the fine-tuning API. This may take some time to complete, but you can track the progress of the fine-tuning job in the Platform UI.

Evaluation

Once the fine-tuning job is complete, it’s time to evaluate the performance of our model by running inference on the test set. This step involves using the fine-tuned model to generate responses to the questions in the test set and comparing its predictions to the ground truth answers for evaluation. We will also run inference on the test set using the non-fine-tuned GPT-4o model for comparison.

Now that we’ve run inference using our fine-tuned model, let’s inspect a few specific examples to understand how well the model performed compared to the actual answers.

As we can see, the fine-tuned model does a great job at answering the questions, with many responses being exactly correct.

However, there are also cases where the model’s predicted answers are close to the ground truth, while not matching exactly, particularly in open-ended questions where phrasing or details may differ. To assess the quality of these predictions, we will use GPT-4o to evaluate the similarity between the predicted responses and the ground truth labels from the dataset.

In order to evaluate our model responses, we will use GPT-4o to determine the similarity between the ground truth and our predicted responses. We will rank our predicted answers based on the following criteria:

  • Very Similar: The predicted answer exactly matches the ground truth and there is no important information omitted, although there may be some minor ommissions or discrepancies in punctuation.

  • Mostly Similar: The predicted answer closely aligns with the ground truth, perhaps with some missing words or phrases.

  • Somewhat Similar: Although the predicted answer has noticeable differences to the ground truth, the core content is accurate and semantically similar, perhaps with some missing information.

  • Incorrect: The predicted answer is completely incorrect, irrelevant, or contains critical errors or omissions from the ground truth.

To fully understand the impact of fine-tuning, we also evaluated the same set of test questions using the non-fine-tuned GPT-4o model.

Let's start by comparing the performance of the fine-tuned model vs the non-fine-tuned model for Closed form (Yes/No) questions.

Note that with the fine-tuned model, we can check for exact matches between the predicted and actual answers because the model has learned to produce consistent answers that follow the response format specified in the system prompt. However, for the non-fine-tuned model, we need to account for variations in phrasing and wording in the predicted answers. Below is an example of a non-fine-tuned model output. As we can see, the final answer is correct but the response format is inconsistent and outputs reasoning in the response.

With a generous allowance for variations in phrasing and wording for the non-fine-tuned model including ignoring case and allowing for partial matches, the fine-tuned model still outperforms the non-fine-tuned model by a margin of 2.64% on this set of questions.

Now, let's compare the performance of the fine-tuned model vs the non-fine-tuned model over all the open-ended questions. First, we'll check for exact matches between the predicted and actual answers, again allowing for general variations in phrasing and wording for the non-fine-tuned model, but maintaining a strict standard for the fine-tuned model.

The improvement in accuracy here is much more pronounced, with the fine-tuned model outperforming the non-fine-tuned model by a substantial margin of 17.97%, even with very generous allowances for variations in phrasing and wording for the non-fine-tuned model!

If we were to afford the same leniency to the fine-tuned model, we would see an additional 4.1% increase in accuracy, bringing the total margin of improvement to 22.07%.

To dig a little deeper, we can also look at the accuracy by question type.

It appears that the largest performance gains for the fine-tuned model are for questions in the Genre category e.g. "What type of book is this?" or "What is the genre of this book?". This might be indicative of the benefits of fine-tuning in general in that we teach the model to classify genres based on the categories present in the training data. However, it also highlights the model's strong visual undserstanding capabilties, since we are able to identify the genre based on the visual content of the book cover alone.

Additionally, we see significant lift in the Title category, which suggests that fine-tuning has boosted the model's OCR capbilities and its ability to understand the layout and structure of the book cover to extract the relevant information.

Finally, let's compare the distribution of similarity ratings between the fine-tuned model and the non-fine-tuned model to allow for variations in phrasing and wording.

The results provide a clear picture of the benefits gained through fine-tuning, without any other modifications.
Comparing the distribution of ratings between the fine-tuned GPT-4o model and GPT-4o without fine-tuning, we see that the fine-tuned model gets many more responses exactly correct, with a comparable amount of incorrect responses.

Key Takeaways

  • Improved Precision: Fine-tuning helped the model produce more precise answers that matched the ground truth, especially in highly domain-specific tasks like OCR on book covers.
  • Better Generalization: While the non-fine-tuned GPT-4o was able to get at least somewhat to the ground truth for many questions, it was less consistent. The fine-tuned model exhibited better generalization across a variety of test questions, thanks to the exposure to multimodal data during training.

While the results from vision fine-tuning are promising, there are still opportunities for improvement. Much like fine-tuning on text, the effectiveness of vision fine-tuning depends heavily on the quality, diversity, and representativeness of the training data. In particular, models benefit from focusing on cases where errors occur most frequently, allowing for targeted improvements.

Upon reviewing the incorrect results, many of the "Incorrect" responses from the fine-tuned model are in fact due to inconsistencies in the labels from the dataset. For example, some ground truth answers provide only the first and last name of the author, whereas the image actually shows the middle initial as well. Similarly, some ground truth labels for the title include subheadings and taglines, whereas others do not.

Another common theme was miscategorization of genres. Although the model was almost always able to produce a semantically similar genre to the ground truth, the answer sometimes deviated. This is likely due to the lack of presence of these genres in the training data. Providing the model with more diverse training examples to cover these genres, or clearer instructions for dealing with edge cases can help to guide the model’s understanding.

Next Steps:

  • Expand the Training Dataset: Adding more varied examples that cover the model’s weaker areas, such as identifying genres, could significantly enhance performance.

  • Expert-Informed Prompts: Incorporating domain-specific instructions into the training prompts may further refine the model’s ability to accurately interpret and respond in complex cases.

Although there is still some progress to be made on this particular task, the initial results are highly encouraging. With minimal setup and effort, we’ve already observed a substantial uplift in overall accuracy with vision fine-tuning, indicating that this approach holds great potential. Vision fine-tuning opens up possibilities for improvement across a wide range of visual question answering tasks, as well as other tasks that rely on strong visual understanding.

Comments (0)

Sign In Sign in to leave a comment.

Spark Drops

Weekly picks: best new AI tools, agents & prompts

Venture Crew
Terms of Service

© 2026, Venture Crew