PublishedMay 22, 2026· Minerva Data Solutions

Before the Company Brain: document ingestion checklist

The pre-flight checks that stop bad OCR, wrong versions, and permission leaks from poisoning your automation layer.

document AIingestionknowledge management

Most failed document AI projects do not fail at the chatbot layer. They fail earlier: messy permissions, duplicate documents, poor OCR, missing ownership, weak metadata, stale files, and no way to prove which version was used.

Before building RAG, build the ingestion discipline.

1. Classify the corpus

Start by separating document families: policies, procedures, contracts, board packs, audit evidence, technical manuals, invoices, emails, and scanned PDFs. Each family has different structure, retention rules, owners, and risk.

Do not treat the whole shared drive as one blob. A contract clause, a policy exception, and a scanned invoice need different parsing and review rules.

2. Preserve source truth

Keep an immutable raw copy. Store extracted text separately. Record the parser version, OCR engine, language, page count, checksum, owner, creation date, modification date, and source location.

If the extracted text later changes because you improve OCR or parsing, you should still know what source produced the old result.

3. Normalize before embedding

Clean headers, footers, page numbers, boilerplate, hyphenation, tables, and repeated legal notices. Keep layout signals when they matter: headings, sections, tables, clauses, signatures, and appendices.

Embedding dirty text creates dirty retrieval. The model cannot compensate for a corpus that was mangled during ingestion.

4. Chunk by meaning, not by arbitrary size

Chunking should respect document structure. Policies often work by section. Contracts work by clause. Manuals work by procedure. Tables need special handling. A 1,000-token blind split may be simple, but it can cut the exact evidence in half.

Every chunk should carry enough metadata to explain itself: document ID, version, page, heading path, clause number, language, access scope, and retention policy.

5. Validate before launch

Run test questions from real users. Include questions with no answer, old-versus-new document conflicts, access-restricted documents, and ambiguous language. Measure retrieval precision before optimizing the LLM.

Further reading: Enterprise document intelligence series and enterprise RAG provenance patterns.

June 19, 2026

Before the Company Brain: document ingestion checklist

1. Classify the corpus

2. Preserve source truth

3. Normalize before embedding

4. Chunk by meaning, not by arbitrary size

5. Validate before launch

Related articles

What is a Company Brain?

Company Brain model strategy: open source, proprietary, or both?

Company Brain summaries are decisions