What is a DocuPipe Schema

Overview

A Schema in DocuPipe defines the shape of the data you want pulled out of a document. Under the hood it is a JSON Schema (Draft 7) document, but DocuPipe only supports a constrained subset of the full JSON Schema spec. This article is the technical reference for that subset: what's in, what's out, and what happens to the parts of JSON Schema that we don't honor.

If you're looking for guidance on writing high-quality schemas (descriptions, examples, structure, iteration), see Designing a Good Schema. This article covers the spec; that one covers the craft.

What's in a schema

Two things travel together as the unit of a "schema" in DocuPipe:

The JSON Schema object - field names, types, and per-field metadata.
The Guidelines - free-text instructions that apply across the whole schema. Stored alongside the schema, not inside it. Use guidelines for rules that span fields ("treat 'company number' as VAT ID", "always prefer the handwritten value over the printed one") rather than rules tied to a single field (those go in the field's description).

When you create or edit a schema in the dashboard or via the API, both pieces are submitted together. Both are sent to the extraction model on every Standardization run.

Supported field types

DocuPipe supports six types:

string
number
integer
boolean
object - a nested group of fields
array - a list of items, where the item shape is defined by the items subschema

That's the complete list. Anything else (a custom type, null as a standalone type) is not supported.

A field can only have one type. JSON Schema technically allows a field's type to be an array (e.g. ["integer", "null"] to express a nullable integer), but DocuPipe collapses those down to a single type during schema save, picking the most general one (string > number > integer > boolean) and dropping null. You don't need multi-type arrays for nullability anyway - see Field optionality below.

Supported per-field metadata

Each primitive field can carry:

description - a short text explaining what the field is. The single biggest lever on extraction quality.
examples - an array of sample values. Helps the model understand format and content variations. Capped at 10 examples per field and 100 characters per example - anything beyond is silently truncated on save (extra examples dropped, long strings cut and suffixed with ...).
default - a default value. Only meaningful for primitive types. If the extraction comes back without the field, the default is filled in during post-processing.
enum - an array of allowed values. Use this when the field can only take one of a fixed set. Don't combine enum with examples - the enum values already define the answer space.
format - only used for date strings, set to "date". Examples should then follow YYYY-MM-DD.

For object fields the supported subfields are type, description, examples, and properties. For array fields they are type, description, examples, and items. Anything else on an object or array is stripped on save.

Root schema fields

The root of a DocuPipe schema is itself an object with exactly four keys:

$schema - the JSON Schema draft URI. Defaults to http://json-schema.org/draft-07/schema# if omitted.
description - a short text describing what this schema extracts. One or two sentences.
type - always "object".
properties - the top-level field definitions.

Anything else at the root (e.g. title, required, additionalProperties) is removed. The schema must end up with exactly these four root keys to be valid.

What's NOT supported

DocuPipe's schema processor handles unsupported JSON Schema constructs in one of two ways: most are silently stripped on save, and a few cause the whole schema to be rejected. Knowing the difference matters - if you assume a stripped construct is doing something for you, you'll be surprised.

Silently stripped

These are removed from your schema during save without an error. If you include them and rely on them for behavior, the saved schema simply won't have them.

required - the JSON Schema array listing which fields are mandatory. Not honored. Every field in a DocuPipe schema is automatically nullable, and the standardization output always includes every field defined in the schema (with null where the value isn't present). See Field optionality below.
additionalProperties - whether unknown fields are permitted. Always treated as if false by our pipeline; you can't loosen it.
patternProperties, pattern - regex-based property matching or value validation.
$ref - schema references for reusing definitions. If you need the same shape in two places, inline it at both.
minimum, maximum - numeric bounds on number and integer fields. Bound expectations belong in the field description ("amount in USD, must be positive") so the model can act on them.
title at the root - the cosmetic schema title. Removed in favor of description.

Schema rejected

These cause schema save to fail entirely:

oneOf, anyOf, allOf - schema composition / union types. Not supported. If a field could be one of several shapes, model it as separate fields whose descriptions say when each applies, or as a single object with all possible members defined as nullable.
A field whose type isn't one of the six supported types (after multi-type collapse).
A schema that doesn't conform to the basic JSON Schema draft structure (caught by an upfront syntactic check).

Recursion / self-reference

Recursive or self-referential schemas (a field that contains itself) are not supported. Inline the shape explicitly at each level to whatever depth you actually need. In practice three to four levels of nesting is the comfortable ceiling - schemas deeper than that tend to extract less reliably regardless of the spec.

Field optionality and nullability

This is the part of the spec that surprises the most people, so it gets its own section.

Every field in a DocuPipe schema is implicitly optional and nullable. Whether or not the value is present in the document, the field will appear in the standardization output:

Value present → the field carries the extracted value.
Value absent or blank → the field carries null (JSON null, not the string "null").

This is enforced in two places. First, DocuPipe asks the extraction model to return null for fields it can't find in the document. Second, our post-processing step adds back any field defined in the schema that the model omitted, with a null value. The net effect: your output shape is stable across runs and across documents. Every schema field is always there, every blank cell is null.

🚧
Don't try to add a required array to make this happen. It's already happening. The required array is stripped on save - it's not a lever the extraction engine reads - and asking the schema improvement assistant to add one will lead nowhere because the next saved version still won't have it. If you want extra certainty in writing, add a line to the schema's Guidelines: "Any field defined in the schema that is not answered or is left blank in the document must be present in the output with a value of null and must not be omitted from the final JSON."

The corollary: there is no way to mark a field as strictly required in the JSON Schema sense (i.e. fail the extraction if the value is missing from the document). All fields are optional. If a field is business-critical, the right move is to make that explicit in the field's description ("Purchase order number - this should appear on every PO document") and then validate downstream in your own code.

📘
The same nullability rule means you don't need ["string", "null"] types or any other JSON Schema trick to "make a field optional". Just declare the type you want - string, integer, etc. - and the field is automatically nullable. Multi-type arrays are collapsed to a single type on save anyway.

Designing a Good Schema - the craft side: writing strong descriptions, structuring fields, iterating on real documents.
Schemas Across Multiple Documents - keeping one schema robust across documents from different sources.
Schemas for Non-English Documents - which language to write the schema itself in when documents aren't in English.

Overview

What's in a schema

Supported field types

Supported per-field metadata

Root schema fields

What's NOT supported

Silently stripped

Schema rejected

Recursion / self-reference

Field optionality and nullability

Related articles