Skip to content

Missing pages vs. blank fields

Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0

Distinguishing MISSING pages from BLANK fields

Section titled “Distinguishing MISSING pages from BLANK fields”

Many forms processed by GenAI IDP are sparsely populated — entire sub-sections, worksheets, or supplemental pages may be legitimately omitted by the submitter. With the default extraction behavior, the system has no way to tell whether a field is empty because:

  • the page was submitted but the field was unfilled (BLANK), or
  • the page wasn’t submitted at all (MISSING)

This guide explains the optional opt-in feature that lets you distinguish these two cases.

Why prompt-engineering doesn’t solve this

Section titled “Why prompt-engineering doesn’t solve this”

A natural first instinct is to instruct the LLM in extraction.task_prompt to “omit fields when the page is missing.” This does not work reliably because the LLM sees only the OCR text it was given — it has no signal about which pages should exist for the document class. The fix has to live in the system, not the prompt.

The feature is opt-in and requires three coordinated config changes:

Use x-aws-idp-page-types on the document class to enumerate the named page sub-types it can include. Each entry has a name, optional description, and an x-aws-idp-document-page-content-regex used to detect the page type from per-page OCR text:

classes:
- $id: BankStatement
x-aws-idp-document-type: BankStatement
x-aws-idp-page-types:
- name: AccountSummary
description: First page with account holder + summary
x-aws-idp-document-page-content-regex: "(?i)account summary|statement period"
- name: TransactionsWorksheet
description: Transactions ledger
x-aws-idp-document-page-content-regex: "(?i)transactions? (history|detail)"
- name: InternationalTransfers
description: International transfers supplement (often omitted)
x-aws-idp-document-page-content-regex: "(?i)international (transfers|wires)"

2. Annotate properties with their source page-types

Section titled “2. Annotate properties with their source page-types”

Use x-aws-idp-source-page-types on each property to declare which page sub-types contain its source data. Properties without this annotation are never treated as MISSING:

properties:
AccountHolderAddress:
type: object
x-aws-idp-source-page-types: ["AccountSummary"]
Transactions:
type: array
x-aws-idp-source-page-types: ["TransactionsWorksheet"]
InternationalTransfers:
type: array
x-aws-idp-source-page-types: ["InternationalTransfers"]
extraction:
missing_field_handling:
enabled: true
representation: omit # or "null_with_metadata"
  • omit — removes the key from inference_result entirely. Best when downstream systems treat missing keys as null and benefit from a smaller payload.
  • null_with_metadata — keeps the key as null and adds the field name to a sibling missing_fields array in the output JSON. Best when downstream systems require a stable, predictable key set.

The Document Schema editor in View / Edit Configuration → Document Schema has first-class widgets for both extensions:

  • The class inspector (right pane when a class is selected) has a “Page Types” section with an “Add Page Type” button that opens an expandable editor for each entry’s name, description, and regex.
  • The property inspector (right pane when a property is selected) shows a “Source Page Types” multi-select under a “Missing-Page Handling” header. The dropdown is populated from the page-type names declared on the parent class — so you cannot reference a page-type that doesn’t exist.

When no page types are declared on the parent class, the property-level widget is hidden so the inspector stays uncluttered.

Once the resolver runs, the per-page text passed to the extraction prompt is annotated with the resolved page type. Compare:

--- PAGE 1 ---
Account Holder: ...
--- PAGE 2 ---
Date | Description ...

with the annotated form:

--- PAGE 1 [AccountSummary] ---
Account Holder: ...
--- PAGE 2 [TransactionsWorksheet] ---
Date | Description ...

The annotation is a soft hint — it can help the LLM scope its attention to the correct page when extracting. The actual enforcement of MISSING semantics happens after the LLM responds, in the post-processing pass.

When the feature is enabled and the class declares page types, the section’s result.json gains two new sibling keys to inference_result:

{
"document_class": {"type": "BankStatement"},
"inference_result": {
"AccountHolderAddress": { ... }
},
"page_type_resolution": {
"declared": true,
"present_page_types": ["AccountSummary"],
"missing_page_types": ["TransactionsWorksheet", "InternationalTransfers"],
"page_id_to_page_type": {"1": "AccountSummary"}
},
"missing_fields_report": [
{
"field": "Transactions",
"reason": "page types not present",
"expected_page_types": ["TransactionsWorksheet"]
},
{
"field": "InternationalTransfers",
"reason": "page types not present",
"expected_page_types": ["InternationalTransfers"]
}
]
}

When representation: null_with_metadata is set, an additional missing_fields array lists the affected field names for downstream consumers that prefer a flat list.

  • Partial presence is BLANK. If a property declares multiple source page-types and any of them are detected, the field is treated as BLANK (left as-is), not MISSING. The intent is conservative — only mark MISSING when no declared source page is present.
  • First-match-wins per page. Page-type entries are evaluated in declaration order. Place more specific patterns before more general ones.
  • Malformed regex is skipped. A bad regex on one page-type entry is logged as a warning and ignored; other entries continue to work.
  • Properties without x-aws-idp-source-page-types are never affected by this feature, regardless of whether the feature is enabled.
  • Nested properties. v1 supports top-level properties only. To mark nested fields as missing, declare the annotation on the containing object/array property at the top level.
  • BDA mode. This feature is Pipeline-mode-only. BDA owns extraction end-to-end and would need separate design work to support page-level absence semantics.
  • Detection cost. Detection is a deterministic regex pass over per-page OCR text — no extra LLM calls, no measurable runtime overhead.

The feature is fully additive. With extraction.missing_field_handling.enabled: false (the default) or with no x-aws-idp-page-types declared on the class, behavior is byte-for-byte identical to earlier releases.