ć楣ĺ¤çĺšśéĺä¸éŽé˘ďźä¸ç§ć ¸ĺżćć
Source: Thoughtworks
By Alfred Subietas IÂ Oliveras
âJust upload it to Copilot.â
That was the first suggestion we got when we started building a document processing pipeline for a large enterprise client.
Itâs a considerable challenge, and given Microsoft has poured billions into AI products like Copilot, why build something custom when thereâs a solution out of the box?
So we uploaded a batch of technical PDFsâââwhich included sustainability reports, building guidelines and inspection handbooksâââinto Copilot.
The results were instructive, but not in the way weâd hoped.
The moment we knew we had a problem
One of our test documents contained a table which mapped building materials to how frequently they are inspected. That was simple enough: rows for materials, columns for inspection types, cells containing timeframes. However, when we asked Copilot to summarize the inspection schedule for concrete facades, we got back a confident answer that bore no resemblance to what the table actually said. Nested headers had been flattened, merged cells had been misattributed and the structure that gave the table meaning was gone.
Next we tried diagrams, such as a process flowchart showing the decision tree for facade repairs. Copilotâs response to this was fragmented. It featured disconnected labels that were extracted from boxes with no relationship to the arrows, branches, or sequence that made it a process.
When we asked the tool the source for one of its outputs, we got vague references to documents that didnât feature page numbers or section references. Copilot simply said âsomewhere in this file.â
These werenât edge cases. For enterprise document processing-where accuracy is non-negotiable and auditability is mandatory-they were disqualifying.
Three problems wearing one mask
Hereâs what we learned: while parsing PDFs might sound like a single problem, itâs actually three; each one requires fundamentally different approaches.
- Text extraction with layout awareness. Multi-column layouts, headers, footers, reading order. Whatâs a heading versus a caption? What flows where? Get this wrong and your downstream processing inherits garbage.
- Table extraction with structure. Rows, columns, nested headers and merged cells. A table that looks perfectly clear in a PDF becomes nonsense when itâs extracted badly. If the program youâre using isnât properly designed for tables, you may even risk extracting only the text content, which means the relationships between cellsâââwhich is often where the meaning livesâââwill be lost.
- Image interpretation. Diagrams, flowcharts and schematics. Traditional OCR sees text fragments at coordinates. It doesnât understand that boxes and arrows represent a process, or that panels in a schematic belong together.
No single toolâââincluding Copilotâââsolves all three well. What we needed was proper parsing: not just extracting text from a PDF, but understanding its structure well enough to preserve meaning. Parsing means recognizing that a cluster of cells is a table, that a sequence of boxes connected by arrows is a process and that a bold line followed by smaller text is a heading and its body. Without parsing, youâre feeding an LLM a bag of words.
This meant we needed a hybrid approach; in other words, different tools for different parsing problems combined into a single pipeline.
Assembling the pipeline
For text and layout, Azure Document Intelligence became our foundation. Its prebuilt-layout model outputs markdown with reasonable structure preservationâââheadings, paragraphs, tables rendered as markdown. It handles the bulk of extraction reliably and at scale. But itâs purely extraction. Thereâs no âreasoningâ or interpretation happening. Even then itâs worth noting it can still struggle with deeply nested tables and will treat figures as scattered text fragments.
For figure extraction, we turned to Docling, an open-source tool that offered something most alternatives donât: polygon cropping. Most tools draw a bounding box around a figure and crop. But diagrams arenât rectangular-they have irregular shapes, overlapping captions and whitespace. Docling, meanwhile, uses polygon coordinates to crop precisely around actual figure boundaries. It also gave us sequential figure numbering and metadata linking each figure to its source document and page-essential for citations later.
For image understanding, we stopped treating the problem as optical character recognition (OCR) and started treating it as comprehension. Traditional OCR looks at a flowchart and sees disjointed textââââInputâ, âProcessâ, âOutputââââall at different coordinates. It doesnât understand arrows, paths or branches.
Instead, for each extracted figure we called a vision-capable language model. We chose GPT-5-mini for its balance of quality and cost. Although latency is higher than other models, for a pipeline that extracts once and queries many times, that trade-off works.
We asked the model: âDescribe what this diagram shows.â
The response might look like: âThis flowchart illustrates the facade inspection process. Starting from the initial assessment, it branches into three paths based on damage severityâŚâ
We then embed that description back into the document. Now, when someone searches for âfacade inspection process,â they find the page containing that diagramâââeven if those exact words never appeared in the original PDF. The model understands spatial relationships, sequences and hierarchies in a way traditional OCR never could.
What we learned along the way
The process surfaced some important things about document processing:
- Positional fidelity is overrated. We initially tried to re-insert figures at their exact original positions in the extracted markdown. It was complex and fragile. Eventually we moved them to the end of each page but retrieval quality didnât change. So, the lesson is donât over-engineer positional accuracy if you donât need it.
- VLM descriptions are first-class content. Itâs important to treat image descriptions as real text in your search index, not metadata. This dramatically improves retrieval for diagram-heavy documents.
- Citations need explicit architecture. Your search index must have URL and title as non-null, searchable fields across all document types. We learned this the hard way; if you miss it citations break in ways that are painful to debug.
- Rate limits are not an afterthought. Both Azure Document Intelligence and OpenAI enforce strict throttling. Backoff strategies, sensible parallelization, and batching need to be designed into the pipeline from day one, not bolted on when you hit walls in production.
The real lesson
Copilot works fine for simple scenarios: a handful of documents, mostly text, citations optional. For that use case, âjust upload itâ is genuinely good advice.
But for complex enterprise PDFs at scaleâââtables with nested structure, diagrams that carry meaning and hundreds of documents requiring precise citationsâââyouâll end up building your own pipeline. And it will be hybrid: one tool for text, another for figures, a third for interpretation.
The deeper insight isnât that Copilot fails; itâs that document understanding was never a single problem to begin with. We just treated it that way because PDFs look like a single format. Theyâre not. Theyâre containers for fundamentally different types of information, each demanding its own approach.
Once we stopped looking for a single tool and started matching tools to problems, things didnât instantly become easy but they did become simpler. And, most importantly, the pipeline actually worked.
Originally published at https://www.thoughtworks.com.