Boost Structured Output Accuracy with Retab’s Likelihood & Reasoning Fields | Retab Blog

TL;DR
Everyone dreams of hitting 100% accuracy when processing documents with AI.
Retab’s confidence scores help you quantify trust, iterate fast to improve your extraction prompt & schema with reasoning fields, and reach +98% accuracy in minutes.
How? Set n_consensus=5 and spot low-confidence fields, drop a single json_schema_extra={"X-ReasoningPrompt": "…"} into any Pydantic model, and watch shaky fields jump from 0.6 → 1.0 confidence.
Works with any best-in-class model; no extra infra.

Why do LLM pipelines still miss fields in 2025?

The answer is simple: hallucinations.

The models’ imprecisions can take various forms:

An incorrect or even imaginary value is returned.
A field is omitted.
An output requires inference that is not done natively (“Convert °C to °F”).
etc.

LLM hallucinations force teams to double-check the outputs, which often ends up being a major time sink.

Human-in-the-loop vs. out-of-the-loop has long been debated — but at Retab, as we move toward fully automated workflows, we've aimed to gradually remove the human from the loop by introducing probability-weighted answers, combined with “Reasoning” powered by chain-of-thought prompting (for compatible models) to boost output accuracy.

How Retab’s Likelihood Engine works

Here is the simple approach:

Use n-consensus

Retab queries the model n times and ensembles the outputs, dampening randomness and returning a full probability distribution (see our blog post on k-LLMs).

Inspect field‑level likelihood scores

Every key comes back with a 0‑1 score—perfect for thresholds like if score < 0.9.

Add a Reasoning trace (optional)

Add X‑ReasoningPrompt to any field to capture the model’s step‑by‑step logic, stored separately from the payload.

This takes advantages of chain-of-thought (see OpenAI’s article on Structured Outputs) to drastically increase the model’s accuracy.

Hands-on: Upgrade your schema in 2 lines

Example 1

from pydantic import BaseModel, Field
from retab import Retab

class Invoice(BaseModel):
    date: str
    invoice_number: str
    total: str
    status: str = Field(
        ...,
        description="Invoice Status: Blank, Paid or Unpaid",
        json_schema_extra={
            "X-ReasoningPrompt": (
                "If Status is blank, state that explicitly; otherwise return Paid or Unpaid."
            )
        }
    )

client = Retab()
resp = client.documents.extract(
    documents=["invoice.jpg"],
    json_schema=Invoice.model_json_schema(),
    model="gpt-4o-mini",
    n_consensus=5,
)
print(resp.likelihoods["status"])   # 1.0!!
print(resp.reasoning["status"])     # "Field absent → mark Blank"

Result: Status jumps from 0.6 → 1.0 likelihood without adding brittle regexes.

Example 2 – Unit conversion on the fly

Same trick, different field:

class TemperatureReport(BaseModel):
    location: str
    temperature: float = Field(
        ...,
        description="Temperature in Fahrenheit",
        json_schema_extra={
            "X-ReasoningPrompt": (
                "If the reported unit is Celsius, convert to Fahrenheit."
            )
        }
    )

The raw file said 22 °C; Retab returned 72.5 °F with 1.0 likelihood.

Should you enable Reasoning?

Enable it when…

You need human auditability (finance, healthcare).
The field is a direct text copy (“Invoice #”).
Downstream rules depend on high precision.

Skip it when…

Latency is a hard SLA (< 300 ms).
You’re training reviewers/operators.
You already surface bounding-box citations only.

Implementation checklist (How-To)

Schema first—declare every key explicitly.
Start with n_consensus=5; tune down after monitoring.
Set thresholds (score < 0.9) to auto-route low-confidence docs.
Store reasoning in a separate column for analytics.
Surface it—overlay bounding boxes + reasoning bubbles on Retab’s platform.

FAQ

How many models are supported?
Retab offers compatibility with you preferred model providers, OpenAI, Anthropic, Grok, DeepSeek, etc.

Is n‑consensus expensive?
Each extra pass costs one model call; most teams settle on 3–5.

Conclusion

Trust is the new accuracy: Likelihood scores quantify it, while Reasoning fields justify it.

One-liner upgrade: a tiny json_schema_extra unlocks explainability without extra infrastructure.

Teams win across the stack, as Devs debug faster, operators review less, and end-users see why.

Don't hesitate to reach out on X or Discord if you have any questions or feedback!

retab.com

TL;DR

Why do LLM pipelines still miss fields in 2025?