Getting the Most out of GPT-5.4 for Vision and Document Understanding

GPT-5.4 is a major step forward for real-world multimodal workloads.

Get this prompt chain

Getting the Most out of GPT-5.4 for Vision and Document Understanding

GPT-5.4 is a major step forward for real-world multimodal workloads.

Documents that previously strained vision systems or required stitching together OCR, layout detection, and custom parsers, including dense scans, handwritten forms, engineering diagrams, and chart-heavy reports, can now often be interpreted and reasoned over in a single model pass with GPT-5.4.

However, model configuration is key for unlocking SOTA results. Small choices around image detail, verbosity, reasoning effort, and tool usage can significantly affect performance.

This notebook focuses on the highest-leverage adjustments for document workloads: image detail, verbosity, reasoning effort, and tool use. The goal is to show when each one matters, how it changes the output, and how to choose a setup that is both robust and practical.

All examples in this notebook use the Responses API via client.responses.create(...). The “settings” we talk about are request parameters you pass into that call.

Input shape

  • input: a list of message-like objects (commonly one { "role": "user", "content": [...] })
  • content: a list of typed blocks, typically:
    • { "type": "input_text", "text": "..." }
    • { "type": "input_image", "image_url": "...", "detail": "auto" | "original" }

Parameters used throughout this notebook

  • Image detail (input_image.detail): controls the image resolution used for vision. Use "auto" for most pages; use "original" when text is tiny, handwritten, or the scan is low-quality.
  • Verbosity (text={"verbosity": ...}): influences how compressed vs literal the text output is. Higher verbosity is helpful for faithful transcription.
  • Reasoning effort (reasoning={"effort": ...}): allocates more compute to multi-step visual reasoning (charts, tables, diagrams) once the image is already readable.
  • Tool use (tools=[...] + instructions=...): optionally lets the model use tools like Code Interpreter to zoom/crop/inspect before answering; omit tools when a single-pass answer is enough.

A minimal request looks like:

response = client.responses.create(
    model="gpt-5.4",
    input=[
        {
            "role": "user",
            "content": [
                {"type": "input_text", "text": "Extract the total amount due."},
                {
                    "type": "input_image",
                    "image_url": "data:image/png;base64,...",
                    "detail": "auto",
                },
            ],
        }
    ],
)

Output shape

  • The model returns a response object with one or more output items.
  • In this notebook, we mostly use response.output_text as a convenient way to get the final text.
  • For structured outputs, you still receive text — you just ask the model to format it as JSON using text={"format": ...} and then json.loads(response.output_text).

A quick decision guide

Use this as a starting point. A good rule of thumb is to start simple, then adjust the setting that matches the failure mode.

If your task looks like this Start with this setup Why
Ordinary document QA or extraction detail="auto" Lowest-friction default for readable pages
Dense scans, screenshots, handwriting, or tiny labels detail="original" Preserves small visual signals that often get lost
Literal transcription or markdown conversion text={"verbosity": "high"} Encourages the model to keep more layout and fewer paraphrases
Region localization Ask for [x_min, y_min, x_max, y_max] in a fixed 0..999 grid Easy to crop, draw, debug, and feed into downstream systems
Chart, table, form, or drawing QA across multiple regions increase reasoning effort to 'high' or 'xhigh' Improves multi-step visual reasoning
Multi-pass visual inspection Add Code Interpreter Best when a human would zoom, crop, rotate, or inspect several subregions before answering

Setup

Before running this notebook, make sure you have OPENAI_API_KEY set in your environment. If you don’t have an API key yet, you can create one at platform.openai.com.

export OPENAI_API_KEY="your_api_key_here"

If needed, install the notebook dependencies:

pip install --upgrade openai pillow

1. Increase image detail for dense pages and handwriting

The detail parameter controls the resolution the model uses when processing an image. Most applications should start with detail="auto" which lets the model choose an appropriate resolution. However, when pages contain handwriting, small labels, dense tables, low contrast scans, or screenshots with fine text, switching to detail="original" can significantly improve results. If the model is mostly correct but consistently misses small fields or annotations, increasing image detail is usually the first adjustment to try.

This example intentionally includes small email and phone fields, not just the larger handwritten names. Those are the kinds of details that tend to degrade first when the image is downsampled.

Handwritten insurance form

2. Increase verbosity for faithful transcription

When asked to transcribe documents, multimodal models tend to compress layout. They preserve meaning but may simplify whitespace, line breaks, and table-like layout. This behavior is often desirable for question answering, but not for OCR-style tasks.

Increase verbosity - text={"verbosity": "high"} encourages the model toward a more literal rendering and precise transcription. Use it for OCR-style workloads and targeted extractions where completeness and formatting fidelity matter.

The example below Ticket To The Arts panel, asking for a full transcription of all four listings while keeping the image detail fixed.

Newspaper clipping

3. Raise reasoning effort when the image is readable but the answer is compositional

Once the image is readable, the next bottleneck is often reasoning instead of perception. This shows up in documents where the answer depends on combining information across multiple parts of the image rather than reading a single field. Charts, tables, technical diagrams, and dense visual layouts often fall into this category.

In those cases, increasing reasoning effort reasoning={"effort": "high"} can help more than increasing image detail. The model already sees the content. What it needs is more capacity to connect labels, compare regions, follow structure, and compute the final answer correctly.

Below are examples of different types of tasks or images where higher reasoning is helpful.

Example: floorplan reasoning

The floorplan below is a good example of a task that goes beyond transcription. To answer correctly, the model has to read room labels, interpret spatial relationships, and use visible dimensions to compute values.

Apartment floorplan

Example: chart understanding

The same pattern shows up in chart understanding. If the task is simply to read a title or identify one plotted value, default settings may be enough. But if the answer depends on comparing multiple series, tracking changes across adjacent intervals, or estimating trends over time, reasoning becomes the limiting factor.

Line chart

Example: long-range visual reasoning on a dense bracket

Dense tournament brackets are a strong candidate for reasoning because the model has to follow paths across a crowded layout, keep left and right regions distinct, and identify the final outcomes without losing track of structure.

Tournament bracket

4. Use Code Interpreter for multi-pass inspection and bounding-box localization

Some document tasks are easier to solve the way a person would: inspect the full page, zoom or crop a region, check another area, and then combine evidence into a final answer.

Code Interpreter is particularly useful for vision tasks when:

  • the page is dense and evidence is spread across multiple regions
  • the model needs to zoom, crop, rotate, or run intermediate checks
  • qualitative accuracy matters more than minimum latency

For localization tasks (including bounding boxes), provide access to code interpreter as well as a strict coordinate contract like [x_min, y_min, x_max, y_max] and a fixed coordinate space such as 0..999 with the origin in the top-left corner.

In practice, this combination (code interpreter tool use + explicit box format) is often more reliable and repeatable than a single-pass vision call.

Police report form

5. If you cannot use Code Interpreter, build a narrow crop-and-rerun pipeline

In restricted environments, you may not want to grant the model a general Python sandbox. A practical alternative is a two-stage workflow:

  1. localize the field or region you care about
  2. crop that region locally
  3. rerun a smaller, more focused prompt on the crop

This often recovers much of the value of multi-pass inspection while keeping the control surface small.

Conclusion

To summarize, start simple: use native vision with detail="auto" and no tools when the task is simple and the page is clear.

Raise image detail (detail="original") when text is tiny, handwritten, low-contrast, or scan quality is poor.

Raise verbosity when you need faithful transcription rather than compressed summaries.

Raise reasoning effort when the image is readable but the answer requires combining multiple regions.

Use Code Interpreter for multi-pass inspection (zoom/crop/rotate), especially on dense pages.

For bounding boxes, require a strict contract: [x_min, y_min, x_max, y_max] in a fixed 0..999 coordinate space (top-left origin), and enforce structured JSON output.

If Code Interpreter is unavailable, use crop-and-rerun: localize, crop locally, then run a focused extraction prompt.

In restricted environments, expose lightweight visual tools (crop/zoom/rotate/OCR-region fallback) for tighter control.

Comments (0)

Sign In Sign in to leave a comment.