TL;DR
We built a workflow that ingests State Farm–style Claims Billing Packages—everything from medical bills and repair estimates to legal demand letters—and routes each file through the right Retab extractor.
This workflow is inspired by existing Retab clients in the insurance industry who already run similar pipelines in production—reducing manual review, accelerating financing, and tightening compliance.
The result? In minutes, you get structured, validated JSON across dozens of document types: CPT/HCPCS-coded bills, contractor invoices, loss runs, subrogation recovery documents, litigation packages, and more.
Perfect for:
Claims Operations—cut manual review by 70–90%
Compliance—audit-ready outputs with confidence scores and reasoning traces
Analytics & Finance—aggregate claims, indemnity, and loss data seamlessly
Why Insurance Packs are so hard to process
When a claims department receives a billing package, it’s rarely one neat PDF. Instead, you get a jumble of document types:
Estimates (auto repair, property damage)
CPT/HCPCS-coded medical bills (CMS-1500, UB-04)
Contractor invoices
Loss-run reports
Subrogation recovery documents
Demand letters from attorneys
Coverage opinions, litigation files, photo evidence
etc.
Traditionally, analysts sift through this mess manually, re-typing data into policy systems.
Each document follows its own schema, and rules-based OCR quickly breaks.
How Retab streamlines the workflow
1. Classify Each Document
We start with a classifier project trained on typical claims packages on the Retab platform. It detects whether a file is a repair estimate, medical bill, contractor invoice, demand letter, or loss run.
clf_res = client.projects.extract(
project_id=CLASSIFIER_PROJECT_ID,
iteration_id=CLASSIFIER_ITERATION_ID,
document="loss_run.jpg",
)
label = clf_res.output["document_type"]
2. Route to the right schema
Each label maps to a dedicated extractor optimized for that document type.
Repair Estimate → Auto repair schema
Medical Bill → CMS-1500/UB-04 schema
Contractor Invoice → Invoice schema
Demand Letter → Legal schema
Loss Run → Loss report schema
etc.
3. Extract Structured Data
Documents are parsed against strict JSON schemas. Field-level confidence scores flag any low-certainty extractions for human review.
Example output from a Loss Run:
{
"report_date": "2020-02-19",
"policy_holder": "New Hanover County Airport Authority",
"policy_number": "STP108078",
"policy_effective_period": {
"start_date": "2015-07-01",
"end_date": "2016-07-01"
},
"claims": [
{
"claim_number": "NJ-05035567",
"claim_type": "MCPD",
"status": "closed",
"incident_date": "2015-08-13",
"claimant_name": "NEW HANOVER COUNTY AIRPORT AUTHORITY",
"insured_name": "NEW HANOVER COUNTY AIRPORT AUTHORITY",
"report_date": "2015-08-27",
"close_date": "2016-11-16",
"loss_description": "The 5000 diesel tank that is for the main terminal generator started having problems with water in the line and when they went in to investigate a puddle of fluid was found. The insured has contacted Catlin engineers.",
"line_of_insurance": "Liability/Property Damage",
"department": "BUILDING",
"financials": {
"total_paid": 0,
"total_incurred": 0,
"indemnity_paid": 0,
"medical_paid": 0,
"vocational_paid": 0,
"legal_paid": 0,
"expense_paid": 0,
"indemnity_reserve": 0,
"medical_reserve": 0,
"vocational_reserve": 0,
"legal_reserve": 0,
"expense_reserve": 0
},
"body_part": "",
"diagnosis": ""
}
],
"total_claim_count": 1
}
This is not regex scraping: Retab enforces schemas, applies k-LLM consensus (you can read our blog-post about k-LLM here) for accuracy, and preserves reasoning traces for auditability.
Why Retab works well here
Schema Validation → Missing or invalid fields fail fast
Confidence Scores → Route uncertain extractions to human review
Consensus Layer → Multiple LLMs vote on each field for near-production accuracy
Auditability → Reasoning traces + source highlighting = regulator-ready
Closing thoughts
Insurance carriers like State Farm process millions of claims documents annually. Every misplaced CPT code or invoice line item can cascade into delays, compliance risks, and cost leakage.
By wiring Retab into the ingestion flow, insurers can:
Cut manual keying costs by >50%
Accelerate claims resolution
Deliver regulators and reinsurers transparent, audit-ready data
If you want to try the pipeline or adapt it to your use-case, check our Platform, Documentation and the related Notebook on GitHub.
Don't hesitate to reach out on X or Discord if you have any questions or feedback!