Schemas for Non-English Documents

Overview

DocuPipe's extraction model handles documents in any of 100+ languages natively. You don't need to translate the document before processing it, and you don't need a language-specific schema. One schema written once can extract from documents in any supported language.

This article covers the one choice you do have to make when your source documents aren't in English: which language to write the schema itself in (field names, descriptions, guidelines, and examples).

Write descriptions and guidelines in English

Our extraction model - like most large language models - reasons most reliably in English. English is by far the richest part of its training data, so instructions written in English are interpreted more precisely than the same instructions in a lower-resource language. Subtle disambiguations (which column to pick, how to handle edge cases, what to do when a field is missing) land more consistently when the instructions are in English.

Best practice: write field names, description text, and guidelines in English, regardless of what language your source documents are in. This is purely an authoring choice and doesn't affect what language the output comes back in - it just means the model reads its instructions in the language it understands best.

Output values come back in the source language

DocuPipe does not translate. If an invoice says Kunde: Muster GmbH, the value in your customer field will be "Muster GmbH" - not an English translation of it.

📘
Output language = source language. If you need values in English, add a dedicated translated field (e.g. a separate field with a description like "The customer name, translated to English"). The default behavior is to preserve the original text exactly as it appears on the document.

Embed source-language labels where they're the cue for a field

The schema authoring language is English, but your documents are not. When a document has a labeled field and you want the model to locate it by that label, include the source-language label verbatim inside the description. The surrounding prose stays in English; the label itself is copied in as-is.

For a German invoice schema:

{
  "customerName": {
    "description": "The customer or buyer name. Labeled 'Kunde', 'Käufer', or 'Rechnungsempfänger' on the invoice. Not the supplier ('Lieferant' or 'Absender').",
    "examples": ["Müller & Söhne GmbH", "Schneider Logistik AG", "Beispiel Handels KG"]
  },
  "billingAddress": {
    "description": "The customer's billing address as printed on the invoice. Include street, postal code, and city as a single string.",
    "examples": ["Hauptstraße 42, 80331 München", "Marktplatz 7a, 10178 Berlin", "Am Bahnhof 15, 20095 Hamburg"]
  },
  "paymentTerms": {
    "description": "The payment terms exactly as printed on the invoice, labeled 'Zahlungsbedingungen' or 'Zahlungsziel'.",
    "examples": ["Zahlbar innerhalb von 30 Tagen netto", "14 Tage 2% Skonto, 30 Tage netto", "Sofort fällig"]
  }
}

Three different jobs in three places:

Field name (customerName, billingAddress, paymentTerms) → English. Keeps the output shape predictable across schemas regardless of source language.
Description → English prose, with German labels (Kunde, Käufer, Zahlungsbedingungen) embedded verbatim as look-up cues. The English tells the model what the field is; the German tells it where on the page to find it.
Examples → German. These are samples of what the extracted value looks like, and the extracted value comes from a German document, so the examples are in German too. An English placeholder here would implicitly ask the model to translate.

The same pattern applies to every language:

Spanish customer name → "Labeled 'Cliente', 'Razón Social', or 'Nombre del Cliente'."
French total amount → "Labeled 'Total TTC' or 'Montant Total'."
Italian due date → "Labeled 'Data di scadenza' or 'Scadenza'."

You don't have to enumerate every possible wording - the model recognizes common synonyms and translations on its own, and will match an English description like "the invoice number" against whichever local-language label actually appears. Listing labels explicitly is most useful when (a) two fields on the document could plausibly collide (e.g. an invoice has both a Kundennummer and a separate customer reference and you need the model to pick one), or (b) your documents use an unusual or domain-specific term the model wouldn't guess. For the typical case, one representative label is plenty.

Examples should be realistic - in the source language

examples are the one part of the schema where the source language matters more than English. Examples aren't instructions for the model - they're samples of what the output is expected to look like. Their job is to show the model what a real extracted value from your documents looks like, so they should match the actual source-language text the model will produce, not English equivalents.

If your documents contain Spanish customer names, your customer examples should be Spanish names:

"examples": ["Muebles García S.A.", "Tecnología Ibérica SL", "Distribuciones Mediterráneo"]

Not English-translated ones. English placeholders in examples quietly tell the model "translate the value" - which is the opposite of what you want.

The rule from Designing a Good Schema still governs: examples must match the output format your description specifies. If the description says dates output as YYYY-MM-DD, every example is in YYYY-MM-DD regardless of source language. But for free-text fields where there's no normalization (names, addresses, item descriptions, notes), the examples should look like what the model will actually extract from the page - which means source-language text in the source language's conventions (accented characters, local name formats, etc.).

🚧
A common mistake: writing the description and examples when you only have mock data in English, then deploying against real documents in another language. The examples end up contradicting reality. Replace mock English examples with real values from an actual document as soon as you have one.

Numbers, dates, and currency

European and other non-English documents often use different numeric conventions. The recommendation in each case is to pick one target format and keep the description and examples consistent with it.

Dates

Default recommendation: always output dates in ISO 8601 format (YYYY-MM-DD), regardless of how the document prints them. ISO is unambiguous, sorts correctly as a string, and is what LLMs handle most reliably - there's no guesswork about whether 03/04/2024 means March 4th or April 3rd.

When you want ISO output, set format: "date" on the field as well. It's a standard JSON Schema hint that some downstream tools respect, and it's a second signal to the model that ISO is the target.

{
  "invoiceDate": {
    "type": "string",
    "format": "date",
    "description": "The invoice issue date. Labeled 'Rechnungsdatum' on German invoices. Output as YYYY-MM-DD regardless of how the document prints it (source is usually DD.MM.YYYY).",
    "examples": ["2024-11-13", "2024-06-30"]
  }
}

You don't have to use ISO - if you have a reason to keep the source format (downstream systems that expect DD.MM.YYYY, documents shown back to end users in local convention, etc.), that's fine. In that case be explicit in the description about what format you want, and make sure the examples match. The thing to avoid is ambiguity - leaving the format unspecified lets the model decide on its own, and different runs may decide differently.

Numbers

1.234,56 (dot thousands separator, comma decimal) is common outside English-speaking countries. If you want normalized numeric output, say so in the description, and use a numeric type ("type": "number" or "integer") rather than a string field.

Currency

If a document shows € 1.234,56, decide whether you want 1234.56 as a number (drops currency symbol, normalizes to English decimal) or "€ 1.234,56" as a preserved string. State the choice explicitly and make the examples match.

Guidelines in English too

Schema-level Guidelines - the free-text instructions that apply across all fields - should also be written in English. Reference source-language labels verbatim when you need to.

"These are B2B invoices in German. Dates on the source documents print as DD.MM.YYYY - always output normalized to YYYY-MM-DD. When a line item is labeled 'MwSt. 0%', mark it as VAT-exempt. Prices are printed with comma as the decimal separator (e.g. '1.234,56') - output as plain numbers (1234.56)."

The guideline is in English, but the label cues (MwSt. 0%) and format conventions (DD.MM.YYYY, 1.234,56) are stated literally so the model can apply them to the actual document.

Mixed-language documents

Some documents mix languages on a single page - for example, a local-language invoice with English line-item descriptions from an international supplier, or a bilingual contract with paragraphs in both languages side by side.

The schema doesn't need to change. Write descriptions in English as usual, and if you care specifically about which language section a value comes from, say so in the description:

"The product description in English. If the document has both local-language and English descriptions side by side, take the English one. If only the local language is present, take that text as-is."

Summary

Write field names, descriptions, and guidelines in English - the model reasons most reliably in English.
Output values come back in the source language - DocuPipe does not translate.
Include source-language labels verbatim in descriptions when they're the cue the model should look for on the page.
examples should contain realistic values in the source language, matching what the model will actually extract from your documents.

Designing a Good Schema - the general playbook for field descriptions, examples, and guidelines.
Using One Schema Across Many Documents - how to make one schema robust across layout and wording variation.

Overview

Write descriptions and guidelines in English

Output values come back in the source language

Embed source-language labels where they're the cue for a field

Examples should be realistic - in the source language

Numbers, dates, and currency

Dates

Numbers

Currency

Guidelines in English too

Mixed-language documents

Summary

Related articles