Using One Schema Across Many Documents
How to build a single Schema that handles layout and wording variation across many similar documents
Overview
The core idea behind DocuPipe is that you build one Schema once and apply it to every similar document you have. A single invoice schema can cover invoices from 50 different suppliers. A single lease schema can cover leases from a dozen different landlords. The output shape stays identical across the whole batch, so your downstream systems always see the same fields in the same place.
The challenge is that real-world documents from different sources vary - different layouts, different wording, different label conventions. A schema built from a single example tends to overfit to that example and miss variations in the next one. This article covers the two practical ways to make your schema robust across that variation.
Same output fields across documents → one schema. Different output fields → separate schemas. This article is about the first case, where the documents vary in layout or wording but you want the same structured output from all of them.
Approach 1: Show the studio several samples while building
When you create a schema via the chat-based studio (https://www.docupipe.ai/dashboard/schemas → +SCHEMA), the AI bases the schema on whatever sample document is currently active. If you only show it one example, the schema will be shaped around that example's quirks.
Swap in additional samples mid-session so the schema accounts for the variations from the start:
- Start the schema chat with one representative sample.
- Let the AI propose an initial schema, then review the preview.
- Click Change Document in the document bar and pick a different sample - ideally one from another source with a slightly different layout.
- Ask the AI to adjust the schema so it also handles this new sample. Point out anything it got wrong on this second example.
- Repeat with 2-3 more varied samples until the schema handles the range of formats you care about.
Each swap forces the AI to reconcile what it built for the previous sample with what it now sees - label differences, reordered sections, missing optional fields. The resulting schema carries wording that isn't tied to any single example.
Pick samples that differ in ways that matter - different vendors, different page counts, edge cases like missing fields. Showing three near-identical samples teaches the schema nothing new. Showing three samples that each expose a different quirk teaches it a lot.
Approach 2: Iterate after running a real batch
The studio pass gets you a strong starting point, but it can't see variation you didn't show it. Once the schema is finalized, run it on a real batch of documents and look at the results side-by-side with the source PDFs.
When you find a mistake:
- Open the standardization result in the viewer.
- Click the Improve button.
- Explain in plain language what the AI got wrong and what you actually wanted. For example: "The invoice number on this document is labeled 'Reference #', not 'Invoice Number'. Pick it up from either label."
- Click Generate Schema. The AI rewrites the schema (creating a new version) with your correction baked into the relevant field descriptions or guidelines.
- Re-standardize your batch against the new schema and verify the fix holds across the documents that were failing, without regressing the ones that were working.
The Improve button also appears on the Schemas tab when you have a single schema selected - use that when you want to iterate on the schema independently of a specific result.
Every Improve cycle produces a new version of the schema, and your old standardizations still point at the old version. Re-run your batch against the new version to see the effect across the documents you care about.
When to stop iterating and split into multiple schemas
Most of the time, a single schema with well-tuned descriptions can handle a wide variety of input formats. But if you find that improving the schema for one document type consistently regresses a different document type - you fix invoices and suddenly receipts get worse, then you fix receipts and invoices drift back - that's a sign you've hit the limit of what one schema can express cleanly.
At that point, split into multiple schemas and use Classification to route each document to the right one. See the Document Classification section for how to set that up.
Don't split preemptively. Splitting adds a classification step, doubles the number of schemas to maintain, and asks you to keep output fields in sync across them. Only split when iteration on a single schema has clearly plateaued.
Related articles
- Designing a Good Schema - the playbook for writing field descriptions and guidelines that hold up across varied input.
- Troubleshooting Extractions - what to do when a specific Standardization comes out wrong, including worked Improve examples.
- Quick Start - five-minute walkthrough of the end-to-end flow.
Updated 1 day ago
