Document Processing is Not a Single Problem: Three Core Challenges

“Just upload it to Copilot.”

That was the first suggestion we got when we started building a document processing pipeline for a large enterprise client.

It’s a considerable challenge, and given Microsoft has poured billions into AI products like Copilot, why build something custom when there’s a solution out of the box?

So we uploaded a batch of technical PDFs — which included sustainability reports, building guidelines and inspection handbooks — into Copilot.

The results were instructive, but not in the way we’d hoped.

The moment we knew we had a problem

One of our test documents contained a table which mapped building materials to how frequently they are inspected. That was simple enough: rows for materials, columns for inspection types, cells containing timeframes. However, when we asked Copilot to summarize the inspection schedule for concrete facades, we got back a confident answer that bore no resemblance to what the table actually said. Nested headers had been flattened, merged cells had been misattributed and the structure that gave the table meaning was gone.

Next we tried diagrams, such as a process flowchart showing the decision tree for facade repairs. Copilot’s response to this was fragmented. It featured disconnected labels that were extracted from boxes with no relationship to the arrows, branches, or sequence that made it a process.

When we asked the tool the source for one of its outputs, we got vague references to documents that didn’t feature page numbers or section references. Copilot simply said “somewhere in this file.”

These weren’t edge cases. For enterprise document processing-where accuracy is non-negotiable and auditability is mandatory-they were disqualifying.

Three problems wearing one mask

Here’s what we learned: while parsing PDFs might sound like a single problem, it’s actually three; each one requires fundamentally different approaches.

Text extraction with layout awareness. Multi-column layouts, headers, footers, reading order. What’s a heading versus a caption? What flows where? Get this wrong and your downstream processing inherits garbage.
Table extraction with structure. Rows, columns, nested headers and merged cells. A table that looks perfectly clear in a PDF becomes nonsense when it’s extracted badly. If the program you’re using isn’t properly designed for tables, you may even risk extracting only the text content, which means the relationships between cells — which is often where the meaning lives — will be lost.
Image interpretation. Diagrams, flowcharts and schematics. Traditional OCR sees text fragments at coordinates. It doesn’t understand that boxes and arrows represent a process, or that panels in a schematic belong together.

No single tool — including Copilot — solves all three well. What we needed was proper parsing: not just extracting text from a PDF, but understanding its structure well enough to preserve meaning. Parsing means recognizing that a cluster of cells is a table, that a sequence of boxes connected by arrows is a process and that a bold line followed by smaller text is a heading and its body. Without parsing, you’re feeding an LLM a bag of words.

This meant we needed a hybrid approach; in other words, different tools for different parsing problems combined into a single pipeline.

Assembling the pipeline

For text and layout, Azure Document Intelligence became our foundation. Its prebuilt-layout model outputs markdown with reasonable structure preservation — headings, paragraphs, tables rendered as markdown. It handles the bulk of extraction reliably and at scale. But it’s purely extraction. There’s no ‘reasoning’ or interpretation happening. Even then it’s worth noting it can still struggle with deeply nested tables and will treat figures as scattered text fragments.

For figure extraction, we turned to Docling, an open-source tool that offered something most alternatives don’t: polygon cropping. Most tools draw a bounding box around a figure and crop. But diagrams aren’t rectangular-they have irregular shapes, overlapping captions and whitespace. Docling, meanwhile, uses polygon coordinates to crop precisely around actual figure boundaries. It also gave us sequential figure numbering and metadata linking each figure to its source document and page-essential for citations later.

For image understanding, we stopped treating the problem as optical character recognition (OCR) and started treating it as comprehension. Traditional OCR looks at a flowchart and sees disjointed text — “Input”, “Process”, “Output” — all at different coordinates. It doesn’t understand arrows, paths or branches.

Instead, for each extracted figure we called a vision-capable language model. We chose GPT-5-mini for its balance of quality and cost. Although latency is higher than other models, for a pipeline that extracts once and queries many times, that trade-off works.

We asked the model: “Describe what this diagram shows.”

The response might look like: “This flowchart illustrates the facade inspection process. Starting from the initial assessment, it branches into three paths based on damage severity…”

We then embed that description back into the document. Now, when someone searches for “facade inspection process,” they find the page containing that diagram — even if those exact words never appeared in the original PDF. The model understands spatial relationships, sequences and hierarchies in a way traditional OCR never could.

What we learned along the way

The process surfaced some important things about document processing:

Positional fidelity is overrated. We initially tried to re-insert figures at their exact original positions in the extracted markdown. It was complex and fragile. Eventually we moved them to the end of each page but retrieval quality didn’t change. So, the lesson is don’t over-engineer positional accuracy if you don’t need it.
VLM descriptions are first-class content. It’s important to treat image descriptions as real text in your search index, not metadata. This dramatically improves retrieval for diagram-heavy documents.
Citations need explicit architecture. Your search index must have URL and title as non-null, searchable fields across all document types. We learned this the hard way; if you miss it citations break in ways that are painful to debug.
Rate limits are not an afterthought. Both Azure Document Intelligence and OpenAI enforce strict throttling. Backoff strategies, sensible parallelization, and batching need to be designed into the pipeline from day one, not bolted on when you hit walls in production.

The real lesson

Copilot works fine for simple scenarios: a handful of documents, mostly text, citations optional. For that use case, “just upload it” is genuinely good advice.

But for complex enterprise PDFs at scale — tables with nested structure, diagrams that carry meaning and hundreds of documents requiring precise citations — you’ll end up building your own pipeline. And it will be hybrid: one tool for text, another for figures, a third for interpretation.

The deeper insight isn’t that Copilot fails; it’s that document understanding was never a single problem to begin with. We just treated it that way because PDFs look like a single format. They’re not. They’re containers for fundamentally different types of information, each demanding its own approach.

Once we stopped looking for a single tool and started matching tools to problems, things didn’t instantly become easy but they did become simpler. And, most importantly, the pipeline actually worked.

Originally published at https://www.thoughtworks.com.