V3 Extraction Engine

DocuPipe's next-generation agentic extraction engine - higher quality, fewer knobs, and smarter document handling

What's New

V3 is a ground-up rebuild of how DocuPipe extracts data from your documents. Instead of a fixed pipeline that requires you to pick the right configuration, V3 uses an agentic AI that reads your document page by page - deciding its own strategy, self-correcting mistakes, and adapting to whatever it finds.

The result: better extraction quality with less configuration on your end.

V3 vs V2 at a Glance

V2V3
How it worksFixed pipeline with manual knobsAgentic AI that reads and reasons page by page
ConfigurationYou choose display mode, split mode, effort levelJust pick standard or high effort - everything else is automatic
Quality~89% accuracy on our eval suite~95% accuracy on the same suite
Page attributionNot availableKnow which page each extracted field came from
Schema requiredOptionalRequired (schemaless extraction stays on V2)

No More Manual Knobs

In V2, you had to guess the right combination of display mode (spatial, sections, image), split mode (auto, never, all), and effort level (standard, high, extended) for your documents. Pick wrong, and extraction quality suffered.

V3 removes all of that. The agent inspects each page and automatically decides how to process it. You just tell it what to extract (your Schema) and it figures out the rest.

Effort Levels

V3 offers two effort levels that control which AI models power the extraction:

EffortCredits per PageBest For
Standard (default)2Most documents - clean invoices, forms, reports
High4Complex or dense documents where maximum accuracy matters

Both use the same agentic architecture. The difference is that high uses more capable (and more expensive) models in the extraction loop.

📘

V2's extended effort level (5 credits/page) is replaced by V3's agentic approach. V3 standard already outperforms V2 extended on most documents, at lower cost.

How V3 Processes Your Document

  1. Receives your Schema and understands what fields to look for
  2. Reads through the document page by page, extracting fields as it goes
  3. Decides when to move on - staying on dense pages longer and advancing past simple ones
  4. Validates and post-processes the result against your schema

Page Attribution

V3 tracks which page each extracted field came from. This is available as a pageMap on the Standardization result - a mapping from field paths to 1-indexed page numbers.

For example, if your schema extracts vendor.name from page 1 and lineItems.0.description from page 3, the pageMap would reflect that. This is useful for building UIs that jump to the source page when a user clicks on a field.

What Stays the Same

  • Schemas work exactly as before. No changes needed to your existing schemas.
  • Guidelines still apply and are passed to the V3 agent.
  • Output format is the same JSON structure you're used to.
  • Webhooks fire the same standardization.processed.success and standardization.processed.error events.
  • Downloads (JSON, Excel, XML, CSV) all work the same way.
  • Credit pricing for standard effort is unchanged at 2 credits per page.

Using V3

From the Dashboard

V3 is now the default when you run a Standardization from the dashboard. Select your documents, click Standardize, choose your schema, and optionally set the effort level to high for complex documents.

From the API

Use the POST /v3/standardize endpoint:

{
  "documentId": "your-document-id",
  "schemaId": "your-schema-id",
  "effortLevel": "standard"
}

Optional parameters:

  • guidelines - additional extraction instructions
  • useMetadata - include document metadata in extraction context
  • pages - extract only specific pages (0-indexed)

The response includes a jobId and standardizationId for tracking progress.

📘

V2 endpoints remain available and are not being removed. If you have integrations using POST /standardize or POST /standardize/batch, they will continue to work.

When to Use High Effort

Standard effort handles most documents well. Consider switching to high when:

  • Documents are long (10+ pages) with dense tables spanning multiple pages
  • You're seeing missed fields on complex layouts
  • The document has unusual formatting that requires deeper reasoning
  • Maximum accuracy is more important than cost

You can always start with standard and selectively re-run problem documents on high.

Supported Languages

V3 supports the same 100+ languages as V2, including print and handwriting recognition for English, Spanish, French, Hebrew, German, Italian, Portuguese, Chinese, Japanese, Korean, Russian, Arabic, and Thai.

FAQ

Do I need to change my schemas for V3? No. Existing schemas work as-is with V3.

Can I still use V2? Yes. V2 endpoints are not being removed. Schemaless extractions automatically use V2.

Is V3 slower than V2? V3 may take slightly longer on multi-page documents since it processes pages sequentially. Single-page documents are comparable in speed. The quality improvement more than offsets any time difference.

Does V3 cost more? Standard effort is the same price as V2 standard (2 credits/page). High effort is 4 credits/page, the same as V2 high. V2's extended tier (5 credits/page) has no V3 equivalent because V3 standard already exceeds V2 extended quality for most documents.