A new standard for splitting accuracy

May 11, 2026 • Lukas Bjorkland (Member of Technical Staff), Sacha Ichbiah (Co Founder and CTO)

Reliable document automation is usually a divide-and-conquer problem. Before a model can extract the right fields, the system has to give it the right context: not too many pages, not a mixed packet, not several document types competing for the same prompt. The cleaner the unit of work, the more focused the extraction call can be.

That is why splitting matters. Many real PDFs arrive as one file even though they contain many documents: forms, schedules, statements, attachments, letters, receipts, and supporting pages. If the system skips the split step, extraction has to do two jobs at once. It must understand the structure of the packet while also pulling fields from the pages inside it. That creates longer prompts, noisier context, and outputs that are harder to debug.

Retab Split separates those jobs. It first converts a compound PDF into coherent subdocuments, then downstream extractors can focus on one unit, one schema, and one task at a time. In tax packets, this is especially important: a state return can follow a federal schedule, a supporting statement can run across several pages, and repeated forms can appear back to back with only subtle visual cues between them.

On PoliTax-Split, a document splitting benchmark, Retab Split reaches 97.6% page accuracy and sets a new standard for splitting performance.

Results

PoliTax-Split studies compound public tax-return packets released by recent White House administrations. These files are not tidy single-form PDFs: they combine federal and state returns, schedules, statements, and supporting pages into long packets where boundaries can be visually subtle.

Loading split viewer...

Evaluation setup

For each document in PoliTax-Split we call POST /v1/splits once, passing the PDF and the list of expected subdocuments declared by the benchmark — the same subdocuments taxonomy shown in the SDK example below. Each entry names a tax form that may appear in the packet (Form 1040, Schedule A (Form 1040), misc_form, …) and whether multiple instances are allowed. No prompts are tuned per document and no examples are passed in: every call uses the same benchmark-wide taxonomy, and only the file changes.

The API returns a list of subdocuments with page ranges. We compare those ranges to the human-labelled ground truth — a page counts as correct when the predicted subdocument matches both the type and the page on the reference label. The numbers below come from running this evaluation across all 30 packets and 1,152 labelled subdocuments, with no filtering and no per-document tuning.

Numbers

We evaluate Micro, Small, and Large on the same PoliTax-Split documents, with the same labels and the same scoring harness, so the comparison isolates the effect of model size. We also include Google Document AI splitter models as external document-splitting baselines on the same packets and labels. At this level of performance, the benchmark is close to saturated: small differences between Large and Small mostly reflect label noise and annotation inconsistencies, not a meaningful gap in model capability.

Most misses are near the right page

Predicted boundary offsets cluster near the correct page, which is exactly what reviewers want: small local checks instead of packet-level reconstruction.

Conclusion

Splitting defines the accuracy ceiling for the entire document workflow. Once a packet is separated into coherent subdocuments, extraction no longer has to reason over every page at once. Each downstream call receives only the pages it needs, the document type it expects, and the schema that fits.

That is the practical value of Retab Split. Long PDFs become usable for end-to-end automation. Instead of treating a tax packet as one ambiguous document, teams can route each detected form, schedule, statement, or attachment into the correct extraction path. Better splitting creates cleaner context, shorter prompts, fewer cross-document errors, and substantially higher extraction accuracy downstream.

The impact is clear on the PoliTax-Split dataset. Strong splitting performance translates directly into more reliable document automation and a much more dependable extraction pipeline overall. The full reproduction kit is available in the split-benchmark repository.