Image vs Text—Choosing the Right Modality for Every Document | Retab Blog

TL;DR
We share our decision tree for choosing between text and vision modalities when processing documents.
For short docs (≤3 pages, e.g. Scans, Excels, etc.) vision wins.
For long, text-heavy docs (>3 pages, e.g. Contracts, Research papers, Equity notes, etc.) text processing is faster and more accurate.
Emails and web pages often need a hybrid approach, with text for content and vision for layout.
Retab’s flexible API lets you switch modalities easily for optimal results.

Introduction

At Retab, after processing hundreds of millions of pages for companies ranging from seed-stage startups to listed enterprises, we are still frequently asked:

“Should this document be handled as text or as an image?”

Well, it depends.

Below is the exact decision tree we rely on, that we’ve battle-tested across PDF hellscapes, Excel monstrosities, and email threads that should have been tickets.

Feel free to borrow these rules for your own pipelines.

Modalities at a Glance

When processing documents, selecting the right modality is crucial for efficiency and accuracy.

The two primary modalities available today in most large language models are text & vision, with audio emerging but not yet widely applicable.

As it is not often obvious which modality is best suited for a given document, we wanted to make sure that you have the possibility to easily switch between modalities with retab.

from retab import Retab

reclient = Retab()

doc_msg = reclient.documents.create_messages(
    document="contract.pdf",
    modality="text", # or "image", "image+text", "native"
)

Note that Retab supports a wide-range of file type, helping you integrate your processing pipelines for all your use-cases.

# Text-Based Files
TEXT_TYPES = Literal[
".txt", ".csv", ".tsv", ".md", ".log", ".html", ".htm", ".xml", ".json", ".yaml", ".yml",
".rtf", ".ini", ".conf", ".cfg", ".nfo", ".srt", ".sql", ".sh", ".bat", ".ps1", ".js", ".jsx",
".ts", ".tsx", ".py", ".java", ".c", ".cpp", ".cs", ".rb", ".php", ".swift", ".kt", ".go", ".rs",
".pl", ".r", ".m", ".scala"
]

# Image-Based Files
IMAGE_TYPES = Literal[".jpg", ".jpeg", ".png", ".gif", ".bmp", ".tiff"]

# Document Files (image or text depending on the length)
EXCEL_TYPES = Literal[".xls", ".xlsx", ".ods"]
WORD_TYPES = Literal[".doc", ".docx", ".odt"]
PPT_TYPES = Literal[".ppt", ".pptx", ".odp"]
PDF_TYPES = Literal[".pdf"]

# Email Files
EMAIL_TYPES = Literal[".eml", ".msg"]  # MIME files containing other MIME files

# Web Files
WEB_TYPES = Literal[".mhtml"]

Short Docs

Short documents (≤3 pages) are often scans of images that contain information-dense visuals—Excel files and tabular data in general are an interesting case-study, with a lot of spatial information (the predecessor of Excel was literally named "VisiCalc").

In such cases, vision language models (VLMs) are more efficient than text-based models, making image processing the best modality.

Check DocVLM: Make Your VLM an Efficient Reader Research paper for more information.

Long Docs

Long documents (>3 pages) are often text-heavy and structured with paragraphs, headings, and lists—Reports, Equity Research papers and Market Research studies are examples of such text-heavy files where the main content is lost in long paragraphs.

Using images for long documents can be challenging because high-quality images take up a lot of space and slow things down. Lowering the image resolution to save space often reduce quality and lead to hallucinations.

Text-based processing excels here, offering scalability, speed, without sacrificing accuracy.

Images

When working with standalone images (e.g., .jpg, .png, .bmp, etc.), using the vision modality is the natural choice.

Emails

The complexity of emails lies in the fact that they contain both a body and attachments—each requiring a different approach to ensure accurate processing and data extraction.

Email Body

The body of an email is more than just plain text—it often includes HTML formatting, embedded images, and links.

As HTML emails can contain a variety of styled elements, such as tables, fonts, and inline images that help present information in a structured way, processing email bodies as images can capture their full visual representation, ensuring that layouts, branding elements, and inline attachments are accurately preserved.

Note that for emails with simpler content, text processing can be more efficient for extracting key information such as dates, sender details, and action items.

It's also important to consider email threading, where multiple messages are grouped together in a conversation. Efficient processing should be able to separate these threads and extract relevant content without losing context.

For this reason, in Retab, we put in the context window of the LLM both the text content of the body, and the images rendering the html content of the body.

Email Attachments

Attachments come in many different formats—PDFs, Excel files, Word documents, and even images. Each type should be processed using the most suitable method:

Text-based attachments (e.g., Word, PDF with selectable text, CSV) should be handled using text processing for easy data extraction.
Image-based attachments (e.g., scanned PDFs, JPEGs, PNGs) should be processed with OCR to extract any embedded text.
Excel files should be processed either as images for visual fidelity or as text when data needs to be structured for analysis.

Handling attachments efficiently ensures that all relevant information within an email thread is accurately captured and categorised, enhancing data retrieval and workflow automation.

Web Pages

Web pages can be saved as a single file format (e.g., `.mhtml`), which closely resembles an email structure. Similar to emails, web pages contain both text and embedded images, making image processing a great choice.

A Hybrid Approach: Image + Text Modality

In some cases, combining both modalities offers the best results.

For example, on complex documents, the LLM can benefit from the image to understand the layout, but struggles to extract the text precisely. In this case, having the image and the text in the context window of the LLM is a great way to get the best of both worlds.

from retab import Retab

reclient = Retab()

doc_msg = reclient.documents.create_messages(
    document="contract.pdf",
    modality="image+text",
)

Conclusion

By following these heuristics and choosing the right modality, you can achieve the best accuracy and efficiency in document processing.

Choosing the right way to process documents—whether using text, vision, or both—makes a huge difference in how accurate and efficient your results are.

Don't hesitate to reach out on X or Discord if you have any questions or feedback!

retab.com