Missing pages vs. blank fields

Distinguishing MISSING pages from BLANK fields

Many forms processed by GenAI IDP are sparsely populated — entire sub-sections, worksheets, or supplemental pages may be legitimately omitted by the submitter. With the default extraction behavior, the system has no way to tell whether a field is empty because:

the page was submitted but the field was unfilled (BLANK), or
the page wasn’t submitted at all (MISSING)

This guide explains the optional opt-in feature that lets you distinguish these two cases.

Why prompt-engineering doesn’t solve this

A natural first instinct is to instruct the LLM in extraction.task_prompt to “omit fields when the page is missing.” This does not work reliably because the LLM sees only the OCR text it was given — it has no signal about which pages should exist for the document class. The fix has to live in the system, not the prompt.

Configuration

The feature is opt-in and requires three coordinated config changes:

1. Declare page sub-types on the class

Use x-aws-idp-page-types on the document class to enumerate the named page sub-types it can include. Each entry has a name, optional description, and an x-aws-idp-document-page-content-regex used to detect the page type from per-page OCR text:

classes:
  - $id: BankStatement
    x-aws-idp-document-type: BankStatement
    x-aws-idp-page-types:
      - name: AccountSummary
        description: First page with account holder + summary
        x-aws-idp-document-page-content-regex: "(?i)account summary|statement period"
      - name: TransactionsWorksheet
        description: Transactions ledger
        x-aws-idp-document-page-content-regex: "(?i)transactions? (history|detail)"
      - name: InternationalTransfers
        description: International transfers supplement (often omitted)
        x-aws-idp-document-page-content-regex: "(?i)international (transfers|wires)"

2. Annotate properties with their source page-types

Use x-aws-idp-source-page-types on each property to declare which page sub-types contain its source data. Properties without this annotation are never treated as MISSING:

    properties:
      AccountHolderAddress:
        type: object
        x-aws-idp-source-page-types: ["AccountSummary"]
      Transactions:
        type: array
        x-aws-idp-source-page-types: ["TransactionsWorksheet"]
      InternationalTransfers:
        type: array
        x-aws-idp-source-page-types: ["InternationalTransfers"]

3. Turn on the missing-field handler

extraction:
  missing_field_handling:
    enabled: true
    representation: omit  # or "null_with_metadata"

omit — removes the key from inference_result entirely. Best when downstream systems treat missing keys as null and benefit from a smaller payload.
null_with_metadata — keeps the key as null and adds the field name to a sibling missing_fields array in the output JSON. Best when downstream systems require a stable, predictable key set.

Editing in the Web UI

The Document Schema editor in View / Edit Configuration → Document Schema has first-class widgets for both extensions:

The class inspector (right pane when a class is selected) has a “Page Types” section with an “Add Page Type” button that opens an expandable editor for each entry’s name, description, and regex.
The property inspector (right pane when a property is selected) shows a “Source Page Types” multi-select under a “Missing-Page Handling” header. The dropdown is populated from the page-type names declared on the parent class — so you cannot reference a page-type that doesn’t exist.

When no page types are declared on the parent class, the property-level widget is hidden so the inspector stays uncluttered.

What the LLM sees

Once the resolver runs, the per-page text passed to the extraction prompt is annotated with the resolved page type. Compare:

--- PAGE 1 ---
Account Holder: ...

--- PAGE 2 ---
Date  | Description ...

with the annotated form:

--- PAGE 1 [AccountSummary] ---
Account Holder: ...

--- PAGE 2 [TransactionsWorksheet] ---
Date  | Description ...

The annotation is a soft hint — it can help the LLM scope its attention to the correct page when extracting. The actual enforcement of MISSING semantics happens after the LLM responds, in the post-processing pass.

What appears in the output JSON

When the feature is enabled and the class declares page types, the section’s result.json gains two new sibling keys to inference_result:

{
  "document_class": {"type": "BankStatement"},
  "inference_result": {
    "AccountHolderAddress": { ... }
  },
  "page_type_resolution": {
    "declared": true,
    "present_page_types": ["AccountSummary"],
    "missing_page_types": ["TransactionsWorksheet", "InternationalTransfers"],
    "page_id_to_page_type": {"1": "AccountSummary"}
  },
  "missing_fields_report": [
    {
      "field": "Transactions",
      "reason": "page types not present",
      "expected_page_types": ["TransactionsWorksheet"]
    },
    {
      "field": "InternationalTransfers",
      "reason": "page types not present",
      "expected_page_types": ["InternationalTransfers"]
    }
  ]
}

When representation: null_with_metadata is set, an additional missing_fields array lists the affected field names for downstream consumers that prefer a flat list.

Edge cases and tuning

Partial presence is BLANK. If a property declares multiple source page-types and any of them are detected, the field is treated as BLANK (left as-is), not MISSING. The intent is conservative — only mark MISSING when no declared source page is present.
First-match-wins per page. Page-type entries are evaluated in declaration order. Place more specific patterns before more general ones.
Malformed regex is skipped. A bad regex on one page-type entry is logged as a warning and ignored; other entries continue to work.
Properties without x-aws-idp-source-page-types are never affected by this feature, regardless of whether the feature is enabled.
Nested properties. v1 supports top-level properties only. To mark nested fields as missing, declare the annotation on the containing object/array property at the top level.
BDA mode. This feature is Pipeline-mode-only. BDA owns extraction end-to-end and would need separate design work to support page-level absence semantics.
Detection cost. Detection is a deterministic regex pass over per-page OCR text — no extra LLM calls, no measurable runtime overhead.

Backwards compatibility

The feature is fully additive. With extraction.missing_field_handling.enabled: false (the default) or with no x-aws-idp-page-types declared on the class, behavior is byte-for-byte identical to earlier releases.