Designing a Good Schema

Practical guidance for authoring a DocuPipe schema that extracts what you want, consistently

Overview

A Schema tells DocuPipe what structured data to pull out of a document. Good schemas produce consistent, accurate extractions across many documents. Vague schemas produce results that shift between runs and miss what you actually care about.

Almost every case of "extraction is inconsistent" or "it keeps picking the wrong value" traces back to the schema, not the engine. The fixes are usually simple: clearer field descriptions, tighter wording, and a structure that mirrors the document.

This guide walks through the concrete habits that separate a schema that mostly works from one that reliably works in production. Examples use a generic supplier invoice throughout.

📘

Two habits account for most of the quality gap between a good schema and a vague one: keeping the schema small (every extra field dilutes the model's attention) and writing strong descriptions for the fields you do include. Both matter more than any structural trick or configuration setting.

Writing great field descriptions

Every field has two places to guide the model: description (text explaining what the field is) and examples (an array of sample values). Both matter, and they have different jobs.

Most fields don't need a long description. If there's only one plausible value on the document, a short description works fine - the model is generally good at finding the obvious thing without being told where to look. Save your effort for the fields that need it: ones with hedging language, ones that could be confused with another field, and ones with a specific output format.

Weak:

"The invoice number."

Stronger:

"The invoice number. Not the purchase order number, not the customer reference number, not the delivery note number."

The weak version leaves the model to guess which number on the document qualifies. The strong version rules out the three most confusable fields so the model knows exactly which value to pick.

Avoid hedging words

Soft words leave the decision to the model, and different model runs will decide differently. Grep your descriptions for these and rewrite any hit:

  • prefer, preferred
  • typically, usually, often
  • may, might, could
  • should, should almost always
  • generally, tends to

Before:

"Use the line-item table when available. Prefer the detailed rows over summary rows when both are present."

After:

"Exactly one entry per row in the main line-item table (the table with columns Product, Quantity, Unit Price, Total). Ignore any summary section below the table - summary totals are captured in subtotal and total, not here."

The first version invites the model to choose. The second removes the choice.

Spell out what NOT to extract

When two fields on your document look similar, the model will confuse them unless you tell it which is which. A single line in each description usually fixes it.

For an invoice with both a "Unit Price" column and a "Line Total" column:

  • unitPrice: "The per-unit price from the column labeled 'Unit Price', 'Price/Unit', or 'Rate'. Do NOT use the 'Line Total' or 'Amount' column here."
  • lineTotal: "The row total from the column labeled 'Line Total', 'Amount', 'Ext. Price', or 'Total'. Equals unitPrice × quantity. Do NOT use the 'Unit Price' column here."

Every time two field names share a word or concept, disambiguate in both descriptions. invoiceDate vs dueDate, shipTo vs shipFrom, subtotal vs total - all benefit from an explicit "do NOT use X" line.

👍

Rule of thumb: if two field names share any word (price, date, address, number), assume the model will confuse them unless both descriptions explicitly say which column each refers to.

Put example values in the examples field, not the description

JSON Schema gives you two fields for guidance, and they do different jobs:

  • description - the rules: where to find the value, how to identify it, what format to output it in, what to skip.
  • examples - an array of concrete sample values.

Don't stuff example values into description text. Put them in examples where they belong.

{
  "description": "The invoice number. Not the purchase order number, not the customer reference number.",
  "examples": ["INV-2024-0147", "25831", "900-45721"]
}

A varied set of 2-5 examples is more useful than a single one, especially if your documents come from different sources with different formats. Include at least one example per format variant.

🚧

Examples must match the output format your description specifies. If the description says dates output as YYYY-MM-DD, every example must be in that form - "2024-11-13", not "Nov 13 2024" or "11/13/2024". Mismatched examples contradict the description and confuse the model more than no examples at all.

{
  "description": "The invoice issue date. Output as YYYY-MM-DD. The document may print the date in other formats like '11/13/2024', 'Nov 13 2024', or '13.11.2024' - always convert to ISO.",
  "examples": ["2024-11-13", "2023-01-05", "2024-06-30"]
}

Structuring your schema

Commit to one granularity per array

If your document has two tables that describe roughly the same thing at different levels - say, a detailed line-item table (20 rows) and a category-summary section (5 rows) - each array field in your schema should pull from exactly one of them.

Don't write a lineItems description that says "use the detail rows, or the summary rows if the detail isn't available". Model runs will inconsistently pick one or the other, which makes your output shape shift between documents.

If you actually need both levels, give them separate top-level arrays: lineItems for the 20-row detail and categorySummary for the 5-row rollup. Each gets a clear, unambiguous description.

Prefer flat arrays when the relationship allows

Extraction tends to be more consistent when you avoid nesting arrays inside array items. Patterns like lineItems[].taxes[] are supported, but they ask extraction to commit to the outer array boundary before it knows what the inner array contains, which invites over-splitting or missed items.

When you can, promote the inner array to its own top-level array with a foreign-key field linking back to the parent:

lineItems: [ { lineId: "L-001", product: "Widget A", quantity: 10, unitPrice: 25.00 } ]
taxes:     [ { lineId: "L-001", taxType: "VAT", rate: 0.20, amount: 50.00 } ]

The lineId field on each tax entry lets you reassemble the relationship downstream.

This is a recommendation, not a hard rule - nested arrays are fine when the relationship is tight and the inner array is short. But if you hit inconsistent extraction on a deeply-nested schema, flattening is often the fastest fix.

Keep your schema as small as it can be

Every field in your schema is text the model has to read and respect. Fields you don't actually use cost you twice: they eat into the model's attention budget, and they add noise to your downstream output.

Before adding a field, ask: will this actually be on most of my documents, and will I use the value? After running your schema on real documents, look at which fields are rarely populated - if they're populated less than 5% of the time and the content is typically wrong when it does fire, remove them. A 15-field schema with strong descriptions beats a 30-field one with sparse descriptions.

Describing the schema itself

At the top level of your schema, alongside its fields, is a description. This is the schema's identity text: what this schema is for, what kinds of documents it applies to. One or two sentences is enough.

{
  "description": "Extracts key fields from supplier invoices, including header totals, line items, tax breakdowns, and payment terms. Designed for invoices from multiple suppliers with varying layouts.",
  "properties": { ... }
}

This framing helps the model pick up conventions that aren't in any single field. It shouldn't include extraction rules - those go in guidelines.

Writing good guidelines

Guidelines are the free-text instructions that apply to the whole schema, not one specific field. This is where document-wide extraction rules go.

Use guidelines to tell the model:

  • How to handle the document as a whole. "If a field appears in both a printed table and a handwritten stamp, always take the handwritten stamp." Rules that span multiple fields.
  • Domain conventions the model won't know. "In our industry, 'unit' refers to pallets, not individual cases." Terminology the model wouldn't infer on its own.
  • How to handle OCR quirks you've seen. "Handwritten digits are often OCR'd as letters and vice versa (B looks like 3, S looks like 5). When the format calls for a letter in a specific position, interpret the character as a letter even if OCR read it as a digit."

Keep guidelines short and factual. Every hedging word in guidelines weakens every field. The same "no soft words" rule applies here.

Iterating on your schema

Schemas get better by running them on real documents, looking at the output, and adjusting. You don't need to build a formal test harness - you just need a handful of real documents and a habit of checking output carefully.

The practical loop

  1. Pick 2-3 real documents that span the formats you care about. If you have 5 suppliers with different invoice layouts, include at least one from each.
  2. Run the schema against each document and look at the output side-by-side with the source PDF.
  3. For every wrong or missing value, figure out which field in the schema is responsible and ask whether its description could have told the model what you actually wanted. Most of the time the fix is to tighten that description - name the label on the document, rule out a confusable neighbor, remove a hedging word. If the description already seems clear and the model is still wrong, that's a model limitation worth flagging to support rather than trying to work around in the schema.
  4. Fix the description or guideline and re-run. Repeat until the extractions match what you want.
  5. Use the Improve button in the dashboard when you want to correct a specific standardization and feed that correction back into the schema. It's the fastest way to tune descriptions based on real mistakes.

Signs your schema is ready

  • Running two documents from the same source produces similarly-shaped output (same array lengths where applicable, same fields populated).
  • When you show the output to someone who hasn't seen the schema, they can match each value back to a specific place on the document.

Signs your schema needs more work

  • The model picks different "correct" values on two similar documents.
  • Fields you expect to always be populated are blank sometimes.
  • Fields you expect to always be blank are populated sometimes.

All of these point to ambiguity in the schema - usually in a description, sometimes in the structure.

Quick checklist before shipping

Before you ship a schema to production:

  1. Search every description for soft words (prefer, typically, usually, may, might, should, often). Rewrite any hit with a definite rule.
  2. For each pair of fields with similar names, check both descriptions explicitly say what to do with the other's column.
  3. For every array field, make sure the description commits to one source table unambiguously.
  4. Put example values in the examples array, not in description text. Confirm the examples match the output format the description specifies.
  5. Run the schema on at least 2 real documents and eyeball the output. Fix anything that looks off.
  6. Remove any field that isn't populated on most of your documents or isn't actually used downstream.

Anti-patterns to avoid

  • Vague descriptions - "The invoice number." Be specific about where and how to find it.
  • Hedging verbs - "may", "could", "should" in descriptions and guidelines. Replace with definite rules.
  • Example values inline in descriptions - put them in the examples array instead.
  • Mismatched examples and format rules - if the description says YYYY-MM-DD, don't include "Nov 13 2024" as an example.
  • Field names as specs - relying on unitPrice vs lineTotal to disambiguate. Always explain the source column for each.
  • Orphaned fields - fields you added "just in case" that never populate or populate incorrectly. Remove them.
  • Implicit domain knowledge - assuming the model knows a technical term or convention specific to your industry. Spell it out in guidelines.
  • Dual-purpose arrays - an array that means "detail rows or summary rows depending on the document". Commit to one; use two arrays if you need both.
  • Deeply nested arrays when the data is naturally flat - consider flattening to top-level arrays with foreign keys if extraction is inconsistent.

When to contact support

If you've tightened your descriptions, removed hedging words, disambiguated look-alike fields, and the extraction is still wrong the same way across multiple documents - that's likely an engine limitation, not a schema issue. Share:

  • The schema ID
  • 1-3 standardization IDs showing the problem
  • A brief note on what you expect vs what's coming out

Support can inspect the extraction trace, identify whether it's a known behavior, and either patch it or suggest a workaround.

Related articles