Blog / Automation

How to Stop AI Hallucinations in Document Automation

The problem with using an LLM to extract data from PDFs isn't the model — it's that most automation workflows trust the model's output without checking it. A single hallucinated field value can corrupt your database, misfeed your CRM, or trigger a downstream workflow with garbage data. Here's how to fix that with a validation layer that makes hallucinations irrelevant.

Brendan Andrew Chase

Brendan Andrew Chase

June 9, 2026  ·  15 min read  ·  Automation

The Real Problem: Trust Without Verification

Here is a workflow that exists in hundreds of businesses right now: a PDF arrives — an invoice, a purchase order, a supplier contract, a job application. An AI model reads it, extracts the relevant fields, and writes those values directly into a database or CRM. Clean, fast, no human in the loop.

And it works. Until it doesn't.

One day the model reads "£12,400" as "£1,2400." Or it extracts a vendor name and returns the vendor's address instead. Or it fills a date field with "as soon as possible" because that phrase appeared near a deadline in the document. Or it confidently returns a VAT number that doesn't exist because the document had a formatting anomaly and the model interpolated something plausible-looking.

That data is now in your system. And unless you have a validation layer between the model's output and your database, you have no way of knowing it happened until something downstream breaks.

This is the hallucination problem in document automation. It's not about the model making things up out of thin air — it's about the model returning plausible-but-wrong values with the same confidence it returns correct ones. The fix is not a better prompt. The fix is a system that doesn't trust the model's output until it has verified it against a rigid schema.

What "Hallucination" Actually Means in Document Processing

In the context of document data extraction, hallucination takes three distinct forms. Understanding which type you're dealing with determines which part of your validation layer catches it.

1

Type Mismatch Hallucination

The model returns a value of the wrong type. A currency field gets a string. A date field gets a relative expression like "next Friday." A numeric field gets a range like "10–15." This is the most common type and the easiest to catch with schema validation.

2

Value Fabrication

The model returns a value of the correct type that doesn't actually appear in the document. A VAT number that looks valid but isn't. An invoice total that's been recalculated incorrectly. A company name that's a plausible variant of the real one. This requires business-rule validation to catch.

3

Field Confusion

The model extracts a real value from the document but assigns it to the wrong field. The billing address ends up in the shipping address field. The subtotal ends up in the total field. The model was reading the right document — it just mapped the fields incorrectly.

Most automation setups that use LLMs for document extraction protect against none of these. The model returns a JSON object, the workflow maps the keys to database fields, and the values go in. The assumption is that if the JSON looks right, the data is right.

That assumption is wrong often enough to cause real problems — and it compounds. A corrupted invoice record doesn't just sit there; it affects your accounts payable run, your supplier relationships, and your tax records.

Why a Better Prompt Won't Fix This

The instinct when something goes wrong with an AI extraction is to fix the prompt. Make it more specific. Tell the model exactly what format you want. Add examples. This helps — but it doesn't solve the structural problem.

Prompt improvements reduce the frequency of errors. They don't eliminate them, and they don't detect them when they occur. A prompt that says "return the invoice total as a number with two decimal places" will produce correctly formatted totals most of the time. But when the document has a formatting quirk the model hasn't seen before, or when the OCR layer upstream produced garbled text, the model will still return something — and that something will still look like a number.

The fundamental constraint

Research has consistently shown that you cannot prompt-engineer hallucinations out of a language model. They are an inherent property of how LLMs generate text — the model is always producing the most statistically likely next token, not the most accurate one. The goal is not to eliminate hallucinations. It's to catch them before they cause damage.

This is not a pessimistic position — it's a realistic one, and it leads to a much more robust architecture than chasing the perfect prompt. Instead of asking "how do I make the model not hallucinate," the better question is: "how do I verify the output before it goes anywhere important?"

The answer is a validation layer. And that validation layer doesn't rely on the model at all — it uses deterministic, rule-based schema checks that either pass or fail with no ambiguity.

The Two-Layer Architecture: Structured Output + Schema Validation

The architecture that makes document automation reliable has two distinct layers. Each one does a different job, and you need both.

Layer 1: Structured Output

You tell the LLM exactly what JSON shape to return. The model is constrained to produce output in that format — it can't return prose, it can't add extra fields, and it must include the fields you specify. This reduces type errors and field confusion significantly.

Layer 2: Schema Validation

A separate, deterministic node takes the model's JSON output and runs it against a JSON Schema definition. Every field is checked: type, format, allowed values, required presence, value ranges. Anything that fails is flagged for human review — it never touches the database.

Layer 1 (structured output) reduces the frequency of malformed output. Layer 2 (schema validation) catches whatever slips through. Together, they create a system where the only data that reaches your database is data that has been verified against explicit, machine-enforced rules.

The key insight is that Layer 2 is entirely independent of the AI model. It doesn't care how the model generated its answer, what prompt you used, or which LLM you're running. It applies a rigid set of rules to a JSON object. Either the JSON passes — or it doesn't. There's no probability here. No hallucination can survive a validation layer that says "this field must be a positive number less than 1,000,000 in the format 0.00."

JSON Schema in Practice: What the Validation Node Checks

Let's make this concrete. Say you're extracting data from supplier invoices. Here's the kind of JSON Schema your validation node would enforce.

JSON Schema — Invoice Extraction Validator

{
  "type": "object",
  "required": [
    "invoice_number",
    "invoice_date",
    "supplier_name",
    "line_items",
    "subtotal",
    "tax_amount",
    "total_due",
    "payment_terms",
    "currency"
  ],
  "additionalProperties": false,
  "properties": {
    "invoice_number": {
      "type": "string",
      "minLength": 1,
      "maxLength": 50
    },
    "invoice_date": {
      "type": "string",
      "format": "date",
      "description": "ISO 8601 — YYYY-MM-DD only"
    },
    "due_date": {
      "type": ["string", "null"],
      "format": "date"
    },
    "supplier_name": {
      "type": "string",
      "minLength": 1,
      "maxLength": 200
    },
    "supplier_vat": {
      "type": ["string", "null"],
      "pattern": "^[A-Z]{2}[0-9A-Z]{2,12}$"
    },
    "currency": {
      "type": "string",
      "enum": ["GBP", "USD", "EUR", "CAD", "AUD"]
    },
    "subtotal": {
      "type": "number",
      "minimum": 0,
      "maximum": 10000000,
      "multipleOf": 0.01
    },
    "tax_amount": {
      "type": "number",
      "minimum": 0,
      "maximum": 10000000,
      "multipleOf": 0.01
    },
    "total_due": {
      "type": "number",
      "minimum": 0,
      "maximum": 10000000,
      "multipleOf": 0.01
    },
    "payment_terms": {
      "type": "string",
      "enum": ["NET_7", "NET_14", "NET_30", "NET_60", "DUE_ON_RECEIPT"]
    },
    "line_items": {
      "type": "array",
      "minItems": 1,
      "items": {
        "type": "object",
        "required": ["description", "quantity", "unit_price", "line_total"],
        "properties": {
          "description": { "type": "string", "minLength": 1 },
          "quantity":    { "type": "number", "minimum": 0 },
          "unit_price":  { "type": "number", "minimum": 0 },
          "line_total":  { "type": "number", "minimum": 0 }
        }
      }
    }
  }
}

Notice what this schema enforces that a prompt cannot:

  • Dates must be ISO 8601. "Next month," "31st May," and "05/31/26" all fail. Only "2026-05-31" passes. If the model returns anything else, the record is flagged — even if the value is technically correct, you know the format is wrong for your database.
  • Currency must be one of five specific codes. The model cannot return "pounds," "sterling," "£," or "British Pounds" — any of which would break a downstream currency conversion lookup. Only the exact enum value is accepted.
  • VAT numbers must match a specific regex pattern. A fabricated or garbled VAT number that doesn't fit the EU/UK format structure fails immediately.
  • All monetary values must be to the penny. multipleOf: 0.01 catches any rounding errors or format anomalies before they reach accounts payable.
  • No extra fields allowed. additionalProperties: false means if the model adds a field you didn't ask for — perhaps a hallucinated "confidence_score" or "notes" field — the entire record fails. The shape must match exactly.
  • Required fields must be present. The model cannot return a record with a missing invoice number or supplier name. Null values are only permitted where explicitly allowed.

And here's the key business-rule check you add on top of the schema — a simple arithmetic verification you run in a Code node immediately after schema validation passes:

n8n Code Node — Cross-field Arithmetic Check

// Check that line items sum to the subtotal (within 1p rounding tolerance)
const data = $input.item.json;

const lineTotal = data.line_items
  .reduce((sum, item) => sum + item.line_total, 0);

const lineTotalRounded  = Math.round(lineTotal * 100) / 100;
const subtotalRounded   = Math.round(data.subtotal * 100) / 100;
const expectedTotal     = Math.round((data.subtotal + data.tax_amount) * 100) / 100;
const extractedTotal    = Math.round(data.total_due * 100) / 100;

const linesMismatch  = Math.abs(lineTotalRounded - subtotalRounded) > 0.01;
const totalMismatch  = Math.abs(expectedTotal - extractedTotal) > 0.01;

if (linesMismatch || totalMismatch) {
  return [{
    json: {
      ...data,
      _validation_status: "FAILED",
      _validation_errors: [
        linesMismatch ? `Line items sum (${lineTotalRounded}) ≠ subtotal (${subtotalRounded})` : null,
        totalMismatch ? `Subtotal + tax (${expectedTotal}) ≠ total_due (${extractedTotal})`   : null,
      ].filter(Boolean)
    }
  }];
}

return [{
  json: { ...data, _validation_status: "PASSED" }
}];

No LLM is involved in either of these checks. They are pure logic. A hallucinated invoice total that doesn't match the sum of line items will fail this check every single time, regardless of how confidently the model returned it.

Building the Validation Layer in n8n

Here's how the complete document automation workflow looks in n8n, from PDF arrival to database write.

Workflow Structure

1

Trigger: Email / Webhook / Google Drive Watch

New PDF arrives. The trigger node passes the raw file to the next step.

2

Extract Text Node (PDF → Text)

Converts the PDF to raw text. If OCR is needed (scanned documents), a separate OCR step runs first. Quality check: flag documents where extracted text is below 200 characters as likely OCR failures before sending to the LLM.

3

LLM Extraction Node (Structured Output)

System prompt instructs the model to extract fields and return them as a JSON object matching the defined schema. The prompt includes the schema definition so the model knows the expected types. Temperature set to 0 for maximum determinism.

4

JSON Schema Validation Node ← The critical step

A Code node runs the model's output through the JSON Schema definition using the ajv library (available via Execute Command or in a Function node). Returns PASSED with the validated data, or FAILED with a list of specific validation errors.

5

Business Rules Check Node ← The second gate

Cross-field arithmetic verification (line items sum to subtotal, subtotal + tax = total). Any discrepancy above £0.01 triggers a FAILED status and adds a specific error message describing which check failed.

6

IF Node: Status = PASSED?

Branches the workflow. PASSED records continue to the database write. FAILED records go to the escalation path.

PASSED branch

Write to database / CRM. Archive the original PDF. Log the successful extraction. Done.

FAILED branch

Send to review queue. Notify the ops team with the specific validation errors. Original PDF attached. Human reviews and corrects.

The workflow has two lanes from step 6 onward. Data that passes validation flows straight to your system with no human involvement. Data that fails validation never touches your database — it goes to a review queue with a precise error message telling a human exactly what was wrong.

This is the key operational shift: instead of reviewing every extracted record, your team only reviews the ones that failed. If your validation pass rate is 95%, that's a 20× reduction in manual review load compared to checking everything. And the 5% that do reach human review come with a specific error message — "line items sum (£1,248.50) ≠ subtotal (£1,248.00)" — so the reviewer knows exactly where to look on the original document.

The Escalation Path: What Happens When Validation Fails

A failed validation record should never be silently dropped — that's just a different kind of data problem. And it should never auto-retry without modification, because the same document will produce the same failure.

The escalation path needs to do three things:

1

Surface the error precisely

The notification to your team should include the specific validation error, not just "extraction failed." "Field 'invoice_date' must match format 'date', got '31/05/2026'" tells the reviewer exactly what to look for. Include the extracted JSON alongside the error so the reviewer can see what the model thought it found.

2

Attach the original document

The reviewer needs to see the source PDF alongside the extracted data. If a VAT number validation fails, the reviewer needs to check the actual document to confirm whether the number was garbled, genuinely invalid, or extracted from the wrong field entirely. These are three different root causes with three different fixes.

3

Track failure patterns over time

Log every validation failure with its error type. After a week, look at your failure distribution. If 80% of failures are date format errors from a specific supplier's PDFs, that's a preprocessing problem — add a date normalisation step upstream. If failures cluster around a specific document type, your prompt needs updating for that type. The validation layer is also a diagnostic tool.

Most teams we work with find that after the first two to four weeks of running this architecture, their failure patterns are clear enough to add targeted pre-processing steps that bring the validation pass rate from ~90% to ~97%. The remaining 3% tend to be genuine anomalies — documents with unusual structure, handwritten annotations over printed fields, or supplier formatting that genuinely doesn't conform to any recognisable pattern. Those few records are flagged, reviewed, and entered manually. That's the right outcome.

Realistic Accuracy Numbers — And What "100% Accurate" Actually Means

Let's be honest about what you can expect.

The claim that document automation can be "100% accurate" needs to be unpacked. Without a validation layer, LLM-based extraction achieves somewhere between 85% and 95% field-level accuracy on well-structured PDFs — higher for digital PDFs, lower for scanned ones, lower still for documents with unusual layouts or dense tables. That sounds reasonable until you realise that 95% field-level accuracy on a 20-field invoice means roughly one field is wrong in every invoice. Across hundreds of invoices a month, that's hundreds of errors in your accounts payable data.

With a validation layer, the accuracy claim changes meaning:

100%

Accuracy of records that pass validation and reach your database

~95%

Typical straight-through processing rate on clean digital PDFs

~80%

Typical straight-through rate on scanned or mixed-format documents

The "100% accurate" statement means: every record that enters your database has passed a deterministic schema check. No hallucinated values are in your system. The records that don't pass are flagged for human review — and that's a feature, not a failure. You've correctly identified the documents where automated extraction wasn't reliable enough to trust.

Over time, as you tune your pre-processing, your prompt, and your schema based on the failure patterns you observe, the straight-through rate improves. Teams that start at 85% straight-through typically reach 95%+ within three months of running the full architecture.

Pre-Launch Checklist for Any Document Automation Workflow

Before you put a document automation workflow into production, work through this checklist. Each item is a failure mode we've seen in the wild.

Schema covers every field that touches a downstream system

If a field is going into a database column, a CRM field, or an API call — it must be in the schema with type constraints. No exceptions.

Date fields enforce ISO 8601 — no local formats

Date format mismatches are the single most common validation failure. Always normalise to YYYY-MM-DD. Build a pre-processing step to catch common supplier formats (DD/MM/YYYY, MM-DD-YY, "31 May 2026") and convert them before the schema check if needed.

Numeric fields have upper bounds

Set a maximum value for every monetary field that reflects your actual business. A £10,000,000 cap on a supplier invoice catches a model that extracted "10,000.00" as "10000.00" from a garbled OCR output.

Enum fields are comprehensive and exhaustive

If a field has a fixed set of valid values (currency codes, payment terms, document types), enumerate all of them in the schema. Review the list with your ops team — they will know edge cases the tech team won't.

LLM temperature is set to 0

For data extraction tasks, you want maximum determinism. Temperature 0 means the model consistently picks the most likely token rather than sampling creatively. This doesn't eliminate errors but reduces variance.

The workflow has been tested on at least 50 real documents

Ten documents is not enough to see edge cases. Run 50 real samples through the full pipeline before going live. Look at every failure. If you can't get 50 samples because you're building before you have volume, run the workflow on historical documents that you can manually verify.

The review queue has a named owner

Validation failures need to be reviewed within a defined SLA. Decide before launch who reviews them, how often, and what the turnaround time is. An unmonitored review queue is as bad as no validation layer at all.

Failure patterns are being logged and reviewed monthly

The validation layer generates data about your extraction quality. That data tells you where to improve the pre-processing, the prompt, and the schema. If you're not reviewing it, you're leaving improvement on the table.

Want This Built for Your Document Workflow?

I design and build document automation systems with full validation layers — for invoice processing, contract extraction, application intake, and any other high-volume document flow. Tell me what you're currently processing and where the unreliable data is causing problems.

Frequently Asked Questions

Will this work with any LLM, or does it require a specific model?

The validation layer is model-agnostic — it takes a JSON object as input and checks it against a schema. Any LLM that can be prompted to return structured JSON output will work. In practice, models with native structured output support (where the API itself enforces JSON output format) reduce the risk of malformed JSON reaching the validator. GPT-4o, Claude 3.5+, and Gemini 1.5+ all support this. For most invoice and contract extraction tasks, a smaller model like Claude Haiku or GPT-4o-mini is accurate enough and significantly cheaper at scale.

What if the validation failure rate is too high to be practical?

A high failure rate on launch is almost always a diagnostic signal, not a fundamental limitation. Look at the failure distribution: if 70% of failures share the same error type, fix that specific issue first. Common culprits are date format mismatches from suppliers using non-ISO dates, currency values that include the currency symbol rather than just the number, and PDF extraction quality issues where the text layer is corrupted. Each of these has a targeted fix that doesn't require changing the LLM or the schema — usually a pre-processing normalisation step that runs before the model sees the document.

Should the model re-attempt extraction when validation fails?

A retry with the exact same prompt on the exact same document will usually produce the same failure — the model is deterministic at temperature 0, and the document hasn't changed. Retries are only useful if you change something: the prompt (adding the specific error to the instructions), the pre-processing (normalising the date format that caused the failure), or the model. For production systems, the more reliable pattern is to send failures to human review with the error clearly surfaced, rather than building auto-retry logic that can get stuck in loops. Reserve retries for infrastructure failures (API timeout, network error) not model output failures.

How is this different from using the LLM's built-in "structured output" feature?

Structured output (available in OpenAI's API and others) constrains the model to return JSON that conforms to a schema you pass in — it handles the formatting guarantee. That's Layer 1. But structured output doesn't validate the values: it ensures the model returns {"total": 0} instead of {"total": "not found"}, but it won't catch a total of 99999.99 when the document clearly shows £1,248.00. Layer 2 validation — the schema check and arithmetic verification — is what catches value errors. You need both layers for a genuinely reliable system.

Can this architecture handle different document types — invoices, contracts, applications — in the same workflow?

Yes, but each document type needs its own schema and its own extraction prompt. The cleanest way to handle this in n8n is to classify the document type first (a lightweight LLM call asking "is this an invoice, a purchase order, or a contract?") and then route to the appropriate extraction + validation sub-workflow. Trying to handle all document types with a single generic schema produces a schema so permissive it catches nothing — you lose the validation benefit entirely. One schema per document type is the correct pattern.

How long does it take to set up this kind of workflow?

For a single document type with a clearly defined schema — say, one supplier's invoice format — a functional workflow with full validation can be built in two to three days. The bulk of that time is in defining the schema precisely (which requires a business conversation about what fields matter and what valid values look like) and testing on real documents. For mixed document types or complex structures like multi-page contracts with variable fields, allow a week to ten days. The pre-launch testing phase — running 50+ real documents and reviewing the failures — should never be skipped, no matter how confident you are in the extraction quality. If you'd like it built for you, get in touch with a sample document and a description of the fields you need.

Brendan Andrew Chase

Written by

Brendan Andrew Chase

AI agent specialist and digital marketing consultant with 10+ years building automation systems for small and mid-sized businesses across the US, UK, and EU. 200+ projects delivered. Founder of Extra Large Marketing Digital, based in Rio de Janeiro.