Document ingestion checklist before you ship RAG
The pre-flight checks that stop bad OCR, wrong versions, and permission leaks from poisoning your retrieval layer.
The pre-flight checks that stop bad OCR, wrong versions, and permission leaks from poisoning your retrieval layer.
Most failed document AI projects do not fail at the chatbot layer. They fail earlier: messy permissions, duplicate documents, poor OCR, missing ownership, weak metadata, stale files, and no way to prove which version was used.
Before building RAG, build the ingestion discipline.
Start by separating document families: policies, procedures, contracts, board packs, audit evidence, technical manuals, invoices, emails, and scanned PDFs. Each family has different structure, retention rules, owners, and risk.
Do not treat the whole shared drive as one blob. A contract clause, a policy exception, and a scanned invoice need different parsing and review rules.
Keep an immutable raw copy. Store extracted text separately. Record the parser version, OCR engine, language, page count, checksum, owner, creation date, modification date, and source location.
If the extracted text later changes because you improve OCR or parsing, you should still know what source produced the old result.
Clean headers, footers, page numbers, boilerplate, hyphenation, tables, and repeated legal notices. Keep layout signals when they matter: headings, sections, tables, clauses, signatures, and appendices.
Embedding dirty text creates dirty retrieval. The model cannot compensate for a corpus that was mangled during ingestion.
Chunking should respect document structure. Policies often work by section. Contracts work by clause. Manuals work by procedure. Tables need special handling. A 1,000-token blind split may be simple, but it can cut the exact evidence in half.
Every chunk should carry enough metadata to explain itself: document ID, version, page, heading path, clause number, language, access scope, and retention policy.
Run test questions from real users. Include questions with no answer, old-versus-new document conflicts, access-restricted documents, and ambiguous language. Measure retrieval precision before optimizing the LLM.
Further reading: Enterprise document intelligence series and enterprise RAG provenance patterns.