Skip to content

Customizing Classification

Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0

Document classification is a key component of the GenAIIDP solution that categorizes each document or page into predefined classes. This guide explains how to customize classification to best suit your document processing needs.

The solution supports multiple classification approaches that vary by pattern:

  • Classification is performed by the BDA (Bedrock Data Automation) project configuration
  • Uses BDA blueprints to define classification rules
  • Not configurable inside the GenAIIDP solution itself
  • Configuration happens at the BDA project level

Pattern 2: Bedrock LLM-Based Classification

Section titled “Pattern 2: Bedrock LLM-Based Classification”

Pattern 2 offers two main classification approaches, configured through different templates:

MultiModal Page-Level Classification with Sequence Segmentation (default)

Section titled “MultiModal Page-Level Classification with Sequence Segmentation (default)”
  • Classifies each page independently using both text and image data
  • Uses sequence segmentation with BIO-like tagging for document boundary detection
  • Each page receives both a document type and a boundary indicator (“start” or “continue”)
  • Automatically segments multi-document packets where multiple documents may be combined
  • Works exceptionally well for complex document packets containing multiple documents of the same or different types
  • Supports optional few-shot examples to improve classification accuracy
  • Deployed when you select ‘few_shot_example_with_multimodal_page_classification’ during stack deployment
  • See the few-shot-examples.md documentation for details on configuring examples

The multimodal page-level classification implements a sophisticated sequence segmentation approach similar to BIO (Begin-Inside-Outside) tagging commonly used in NLP. This enables accurate segmentation of multi-document packets where a single file may contain multiple distinct documents.

How It Works:

Each page receives two pieces of information during classification:

  1. Document Type: The classification label (e.g., “invoice”, “letter”, “financial_statement”)
  2. Document Boundary: A boundary indicator that signals document transitions:
    • "start": Indicates the beginning of a new document (similar to “Begin” in BIO)
    • "continue": Indicates continuation of the current document (similar to “Inside” in BIO)

Benefits of Sequence Segmentation:

  • Multi-Document Packet Support: Accurately segments packets containing multiple documents
  • Type-Aware Boundaries: Detects when a new document of the same type begins
  • Automatic Section Creation: Pages are grouped into sections based on both type and boundaries
  • Improved Accuracy: Context-aware classification that considers document flow
  • No Manual Splitting Required: Eliminates the need to manually separate documents before processing

Example Segmentation:

Consider a packet with 6 pages containing two invoices and one letter:

Page 1: type="invoice", boundary="start" → Section 1 (Invoice #1)
Page 2: type="invoice", boundary="continue" → Section 1 (Invoice #1)
Page 3: type="letter", boundary="start" → Section 2 (Letter)
Page 4: type="letter", boundary="continue" → Section 2 (Letter)
Page 5: type="invoice", boundary="start" → Section 3 (Invoice #2)
Page 6: type="invoice", boundary="continue" → Section 3 (Invoice #2)

The system automatically creates three sections, properly separating the two invoices despite them having the same document type.

The multimodal page-level classification supports including surrounding pages as context to improve classification accuracy. This is particularly useful when a single page doesn’t contain enough information to determine its document type or boundary status.

Configuration:

classification:
classificationMethod: multimodalPageLevelClassification
contextPagesCount: 1 # Include 1 page before and 1 page after as context
# contextPagesCount: 0 # Default: no additional context (current behavior)
# contextPagesCount: 2 # Include 2 pages before and 2 pages after

How It Works:

When contextPagesCount is set to a value greater than 0, the classification prompt includes surrounding pages as additional context:

  • contextPagesCount: 1: Includes 1 page before and 1 page after the target page
  • contextPagesCount: 2: Includes 2 pages before and 2 pages after the target page
  • Edge handling: At document boundaries, only available pages are included (e.g., first page has no “before” pages)

Enhanced Prompt Structure:

The system replaces the standard {DOCUMENT_TEXT} and {DOCUMENT_IMAGE} placeholders with context-aware versions that clearly separate context pages from the page being classified:

Text Context Structure:

For context, here is the OCR text for the page(s) immediately prior to the page you should classify:
<context-pages-before>
[OCR text from all context pages before - combined if multiple pages]
</context-pages-before>
Here is the OCR text for the page to classify:
<current-page>
[OCR text for the page being classified]
</current-page>
For context, here is the OCR text for the page(s) immediately after the page you should classify:
<context-pages-after>
[OCR text from all context pages after - combined if multiple pages]
</context-pages-after>

Image Context Structure:

For context, here are the image(s) for the page(s) immediately prior to the page you should classify:
[Image 1 - context page before]
[Image 2 - context page before (if contextPagesCount >= 2)]
Here is the image for the page to classify:
[Image - current page being classified]
For context, here are the image(s) for the page(s) immediately after the page you should classify:
[Image 1 - context page after]
[Image 2 - context page after (if contextPagesCount >= 2)]

Note: Context pages are combined within their respective sections (before or after). The structure uses descriptive text labels and XML tags (<context-pages-before>, <current-page>, <context-pages-after>) to clearly indicate which content is for context versus which content should be classified.

Benefits:

  • Improved Boundary Detection: Context helps the LLM identify document transitions
  • Better Classification Accuracy: Surrounding pages provide additional clues
  • Handles Ambiguous Pages: Pages that look similar can be distinguished by context
  • Flexible Configuration: Adjust context size based on document complexity

Use Cases:

  • Documents where headers/footers span multiple pages
  • Multi-page forms where individual pages look similar
  • Document packages with varying page layouts
  • Cases where LLM boundary detection has been unreliable

Considerations:

  • Increases token usage proportionally to the number of context pages
  • May increase latency due to larger prompts
  • Works best when surrounding pages provide meaningful classification hints

Configuration for Boundary Detection:

The boundary detection is automatically included in the classification results. No special configuration is needed - the system will populate the document_boundary field in the metadata for each page:

{
"page_id": "1",
"classification": {
"doc_type": "invoice",
"confidence": 0.95,
"metadata": {
"document_boundary": "start" // New document begins
}
}
}
  • Analyzes entire document packets to identify logical boundaries
  • Identifies distinct document segments within multi-page documents
  • Determines document type for each segment
  • Better suited for multi-document packets where context spans multiple pages
  • Deployed when you select the default pipeline mode configuration during stack deployment or update

The default configuration in config_library/unified/default/config.yaml implements this approach with a task prompt that instructs the model to:

  1. Read through the entire document package to understand its contents
  2. Identify page ranges that form complete, distinct documents
  3. Match each document segment to one of the defined document types
  4. Record the start and end pages for each identified segment

Example configuration:

classification:
classificationMethod: textbasedHolisticClassification
model: us.amazon.nova-pro-v1:0
task_prompt: >-
<task-description>
You are a document classification system. Your task is to analyze a document package
containing multiple pages and identify distinct document segments, classifying each
segment according to the predefined document types provided below.
</task-description>
<document-types>
{CLASS_NAMES_AND_DESCRIPTIONS}
</document-types>
<document-boundary-rules>
Rules for determining document boundaries:
- Content continuity: Pages with continuing paragraphs, numbered sections, or ongoing narratives belong to the same document
- Visual consistency: Similar layouts, headers, footers, and styling indicate pages belong together
- Logical structure: Documents typically have clear beginning, middle, and end sections
- New document indicators: Title pages, cover sheets, or significantly different subject matter signal a new document
</document-boundary-rules>
<<CACHEPOINT>>
<document-text>
{DOCUMENT_TEXT}
</document-text>
Limitations of Text-Based Holistic Classification
Section titled “Limitations of Text-Based Holistic Classification”

Despite its strengths in handling full-document context, this method has several limitations:

Context & Model Constraints::

  • Long documents can exceed the context window of smaller models, resulting in request failure.
  • Lengthy inputs may dilute the model’s focus, leading to inaccurate or inconsistent classifications.
  • Requires high-context models such as Amazon Nova Premier, which supports up to 1 million tokens. Smaller models are not suitable for this method.
  • For more details on supported models and their context limits, refer to the Amazon Bedrock Supported Models documentation.

Scalability Challenges: Not ideal for very large or visually complex document sets. In such cases, the Multi-Modal Page-Level Classification method is more appropriate.

  • Classification is performed by a pre-trained UDOP (Unified Document Processing) model
  • Model is deployed on Amazon SageMaker
  • Performs multi-modal page-level classification (classifies each page based on OCR data and page image)
  • Not configurable inside the GenAIIDP solution

The sectionSplitting configuration controls how classified pages are grouped into document sections. This setting works with both classification methods and provides three strategies:

1. disabled - No Splitting (Entire Document = One Section)

Section titled “1. disabled - No Splitting (Entire Document = One Section)”

Behavior:

  • All pages are assigned to a single section
  • Uses majority voting to determine the document class (most common classification wins)
  • Excludes unclassifiable/blank pages from voting to prevent them from affecting the result
  • If there’s a tie, uses the first page’s classification for determinism
  • Ignores any page-level classification boundaries

Use Cases:

  • Documents known to be single-type with no internal divisions
  • Simplified processing where granular section splitting isn’t needed
  • When you want to force all pages to be treated as one cohesive document
  • Documents with occasional blank or unclassifiable pages (these won’t affect the final classification)

Configuration Example:

classification:
sectionSplitting: disabled
classificationMethod: multimodalPageLevelClassification

Result:

  • Document with 10 pages → 1 section containing all 10 pages
  • All pages assigned the most common (voted) class

Voting Behavior:

The disabled strategy uses majority voting to determine the document classification, which provides robust handling of edge cases:

  1. Config-Driven Voting: Only pages whose classification matches a valid document type defined in your configuration are eligible to vote. This automatically excludes:

    • Blank pages (unclassifiable_blank_page, blank, etc.)
    • Error states (error (backoff/retry), unclassified)
    • LLM hallucinations or typos that don’t match any defined class
  2. Majority Wins: The classification that appears most frequently among votable pages becomes the document classification.

  3. Tie-Breaking: If multiple classifications have the same count, the classification from the earliest page (by page number) is used for determinism.

  4. Fallback: If no pages have valid classifications (all are unclassifiable types), the first page’s classification is used.

Example:

6-page document with classifications:
- Page 1: DRILLING_PLAN_GEOLOGIC
- Page 2: DRILLING_PLAN_GEOLOGIC
- Page 3: DRILLING_PLAN_GEOLOGIC
- Page 4: DRILLING_PLAN_GEOLOGIC
- Page 5: DRILLING_PLAN_GEOLOGIC
- Page 6: unclassifiable_blank_page (excluded from voting)
Voting result: DRILLING_PLAN_GEOLOGIC (5 votes)
→ Entire document classified as DRILLING_PLAN_GEOLOGIC

GitHub Issue Reference: This voting behavior addresses Issue #167 where documents with blank last pages were incorrectly classified as the blank page type.

2. page - Per-Page Splitting (Each Page = Own Section)

Section titled “2. page - Per-Page Splitting (Each Page = Own Section)”

Behavior:

  • Every page becomes an independent section
  • Each page keeps its individually classified document type
  • Prevents automatic joining of same-type documents

Use Cases:

  • Critical for long documents with multiple same-type forms (e.g., multiple W-2 forms, multiple invoices)
  • When LLM boundary detection is unreliable or fails frequently
  • Government form processing where each form must be processed independently
  • Scenarios where deterministic splitting is required

Configuration Example:

classification:
sectionSplitting: page
classificationMethod: multimodalPageLevelClassification

Result:

  • Document with 10 pages → 10 sections (one per page)
  • Each page maintains its individual classification

GitHub Issue Reference: This strategy directly addresses Issue #146 where long documents with multiple same-type forms were being incorrectly joined together.

3. llm_determined - LLM Boundary Detection (Default)

Section titled “3. llm_determined - LLM Boundary Detection (Default)”

Behavior:

  • Uses “Start”/“Continue” boundary indicators from LLM responses
  • Automatically groups related pages into logical sections
  • Implements BIO-like tagging for sophisticated document segmentation

Use Cases:

  • Complex multi-document packets requiring intelligent boundary detection
  • When LLM boundary detection works reliably
  • Default behavior that works well for most use cases

Configuration Example:

classification:
sectionSplitting: llm_determined # This is the default
classificationMethod: multimodalPageLevelClassification

Result:

  • Document with 10 pages → Variable number of sections based on LLM boundary detection
  • Pages grouped according to document boundaries and type changes
StrategySections CreatedBoundary DetectionSame-Type HandlingDeterministicPerformance
disabled1 section alwaysNoneAll joinedYesFastest
pageN sections (N pages)Per-pageNever joinedYesFast
llm_determinedVariableLLM boundariesMay joinNoStandard

The sectionSplitting setting is placed in the classification configuration section:

classification:
model: us.amazon.nova-pro-v1:0
classificationMethod: multimodalPageLevelClassification
sectionSplitting: page # Options: disabled, page, llm_determined
maxPagesForClassification: "ALL"
temperature: "0.0"
# ... other classification settings

The sectionSplitting setting works with both classification methods:

With multimodalPageLevelClassification:

  • disabled: First page’s class applies to all pages in one section
  • page: Each page’s individual classification preserved in separate sections
  • llm_determined: Pages grouped by class + boundary metadata

With textbasedHolisticClassification:

  • disabled: First segment’s class applies to all pages in one section
  • page: Each page gets its own section with the class assigned by holistic method
  • llm_determined: LLM-determined segments used as sections (default behavior)

Consider a 6-page document containing three W-2 forms (2 pages each):

With sectionSplitting: llm_determined (may work or may fail):

Result depends on LLM boundary detection accuracy
Best case: 3 sections (one per W-2)
Worst case: 1 section (all W-2s incorrectly joined)

With sectionSplitting: page (deterministic solution):

Page 1 → Section 1 (W-2)
Page 2 → Section 2 (W-2)
Page 3 → Section 3 (W-2)
Page 4 → Section 4 (W-2)
Page 5 → Section 5 (W-2)
Page 6 → Section 6 (W-2)
Result: 6 independent sections
Each W-2 page processed separately
No risk of incorrect joining

With sectionSplitting: disabled (simplest case):

All 6 pages → Section 1 (W-2)
Result: Single section
Entire document treated as one unit

When deciding between Text-Based Holistic Classification and MultiModal Page-Level Classification with Sequence Segmentation, consider these factors:

Use Text-Based Holistic Classification When:

Section titled “Use Text-Based Holistic Classification When:”
  • Documents have clear logical boundaries based on content
  • Text context spans multiple pages and requires understanding the full document
  • You have access to high-context models (e.g., Amazon Nova Premier)
  • Document packets are relatively small (within model context limits)
  • Visual elements are less important than textual continuity

Use MultiModal Page-Level Classification with Sequence Segmentation When:

Section titled “Use MultiModal Page-Level Classification with Sequence Segmentation When:”
  • Document packets contain multiple documents of the same type (e.g., multiple invoices)
  • Visual layout and image content are important for classification
  • You need to process very large document packets that might exceed context limits
  • Documents have clear visual boundaries (headers, footers, different layouts)
  • You want to leverage both text and image information for better accuracy
  • Processing speed is important (parallel page processing is possible)
FeatureText-Based HolisticMultiModal Page-Level with Sequence Segmentation
Context AwarenessFull document contextPage-level with boundary detection
Multi-document PacketsGoodExcellent (handles same-type documents)
Visual ProcessingText onlyText + Images
Model RequirementsHigh-context modelsStandard models
Processing SpeedSequentialCan be parallelized
Boundary DetectionContent-basedBIO-like tagging
Large DocumentsLimited by contextNo practical limit

Control how many pages are used for classification:

classification:
maxPagesForClassification: "ALL" # Default: use all pages
# Or: "1", "2", "3", etc. - use only first N pages

Important: When set to a number (e.g., "3"), only the first N pages are classified, but the result is applied to ALL pages in the document. This forces the entire document to be assigned a single class with one section.

In Pattern 2, you can customize classification behavior through various prompt components:

Define overall model behavior and constraints:

system_prompt: |
You are an expert document classifier specializing in financial and business documents.
Your task is to analyze document images and classify them into predefined categories.
Focus on visual layout, textual content, and common patterns found in each document type.
When in doubt, analyze the most prominent features like headers, logos, and form fields.

Specify classification instructions and formatting:

task_prompt: |
Analyze the following document page and classify it into one of these categories:
{{document_classes}}
Return ONLY the document class name without additional explanations.
If the document doesn't fit any of the provided classes, classify it as "other".

Provide detailed descriptions for each document category:

document_classes:
invoice:
description: "A commercial document issued by a seller to a buyer, related to a sale transaction and indicating the products, quantities, and agreed prices for products or services."
receipt:
description: "A document acknowledging that something of value has been received, often as proof of payment."
bank_statement:
description: "A document issued by a bank showing transactions and balances for a specific account over a defined period."

The solution integrates with Amazon Bedrock CachePoint for improved performance:

  • Caches frequently used prompts and responses
  • Reduces latency for similar classification requests
  • Optimizes costs through response reuse
  • Automatic cache management and expiration

CachePoint is particularly beneficial with few-shot examples, as these can add significant token count to prompts. The <<CACHEPOINT>> delimiter in prompt templates separates:

  • Static portion (before CACHEPOINT): Class definitions, few-shot examples, instructions
  • Dynamic portion (after CACHEPOINT): The specific document being processed

This approach allows the static portion to be cached and reused across multiple document processing requests, while only the dynamic portion varies per document, significantly reducing costs and improving performance.

Example task prompt with CachePoint for few-shot examples:

classification:
task_prompt: |
Classify this document into exactly one of these categories:
{CLASS_NAMES_AND_DESCRIPTIONS}
<few_shot_examples>
{FEW_SHOT_EXAMPLES}
</few_shot_examples>
<<CACHEPOINT>>
<document_content>
{DOCUMENT_TEXT}
</document_content>

The solution includes standard document classes based on the RVL-CDIP dataset:

  • letter: Formal written correspondence
  • form: Structured documents with fields
  • email: Digital messages with headers
  • handwritten: Documents with handwritten content
  • advertisement: Marketing materials
  • scientific_report: Research documents
  • scientific_publication: Academic papers
  • specification: Technical specifications
  • file_folder: Organizational documents
  • news_article: Journalistic content
  • budget: Financial planning documents
  • invoice: Commercial billing documents
  • presentation: Slide-based documents
  • questionnaire: Survey forms
  • resume: Employment documents
  • memo: Internal communications

You can define custom document classes through the Web UI configuration:

  1. Navigate to the Configuration section
  2. Select the Document Classes tab
  3. Click “Add New Class”
  4. Provide:
    • Class name (machine-readable identifier)
    • Display name (human-readable name)
    • Detailed description (to guide the classification model)
  5. Save changes

Image Placement with {DOCUMENT_IMAGE} Placeholder

Section titled “Image Placement with {DOCUMENT_IMAGE} Placeholder”

Pattern 2 supports precise control over where document images are positioned within your classification prompts using the {DOCUMENT_IMAGE} placeholder. This feature allows you to specify exactly where images should appear in your prompt template, rather than having them automatically appended at the end.

Without Placeholder (Default Behavior):

classification:
task_prompt: |
Analyze this document:
{DOCUMENT_TEXT}
Classify it as one of: {CLASS_NAMES_AND_DESCRIPTIONS}

Images are automatically appended after the text content.

With Placeholder (Controlled Placement):

classification:
task_prompt: |
Analyze this document:
{DOCUMENT_IMAGE}
Text content: {DOCUMENT_TEXT}
Classify it as one of: {CLASS_NAMES_AND_DESCRIPTIONS}

Images are inserted exactly where {DOCUMENT_IMAGE} appears in the prompt.

Image Before Text Analysis:

task_prompt: |
Look at this document image first:
{DOCUMENT_IMAGE}
Now read the extracted text:
{DOCUMENT_TEXT}
Based on both the visual layout and text content, classify this document as one of:
{CLASS_NAMES_AND_DESCRIPTIONS}

Image in the Middle for Context:

task_prompt: |
You are classifying business documents. Here are the possible types:
{CLASS_NAMES_AND_DESCRIPTIONS}
Examine this document image:
{DOCUMENT_IMAGE}
Additional text content extracted from the document:
{DOCUMENT_TEXT}
Classification:

The {DOCUMENT_IMAGE} placeholder works seamlessly with few-shot examples:

classification:
task_prompt: |
Here are examples of each document type:
{FEW_SHOT_EXAMPLES}
Now classify this new document:
{DOCUMENT_IMAGE}
Text: {DOCUMENT_TEXT}
Classification: {CLASS_NAMES_AND_DESCRIPTIONS}
  • 🎯 Contextual Placement: Position images where they provide maximum context
  • 📱 Better Multimodal Understanding: Help models correlate visual and textual information
  • 🔄 Flexible Prompt Design: Create prompts that flow naturally between different content types
  • ⚡ Improved Performance: Strategic image placement can improve classification accuracy
  • 🔒 Backward Compatible: Existing prompts without the placeholder continue to work unchanged

For documents with multiple pages, the system provides comprehensive image support:

  • No Image Limits: All document pages are processed following Bedrock API removal of image count restrictions
  • Info Logging: System logs image counts for monitoring and debugging purposes
  • Automatic Pagination: Images are processed in page order for all pages

Pattern 2’s multimodal page-level classification supports few-shot example prompting, which can significantly improve classification accuracy by providing concrete document examples. This feature is available when you select the ‘few_shot_example_with_multimodal_page_classification’ configuration.

  • 🎯 Improved Accuracy: Models understand document patterns better through concrete examples
  • 📏 Consistent Output: Examples establish exact structure and formatting standards
  • 🚫 Reduced Hallucination: Examples reduce likelihood of made-up classifications
  • 🔧 Domain Adaptation: Examples help models understand domain-specific terminology
  • 💰 Cost Effectiveness with Caching: Using prompt caching with few-shot examples significantly reduces costs

In Pattern 2, few-shot examples are configured within document class definitions using JSON Schema format:

classes:
- $schema: "https://json-schema.org/draft/2020-12/schema"
$id: Letter
x-aws-idp-document-type: Letter
type: object
description: "A formal written correspondence..."
properties:
SenderName:
type: string
description: "The name of the person who wrote the letter..."
x-aws-idp-examples:
- x-aws-idp-class-prompt: "This is an example of the class 'Letter'"
name: "Letter1"
x-aws-idp-image-path: "config_library/unified/your_config/example-images/letter1.jpg"
- x-aws-idp-class-prompt: "This is an example of the class 'Letter'"
name: "Letter2"
x-aws-idp-image-path: "config_library/unified/your_config/example-images/letter2.png"

The imagePath field supports multiple formats:

  • Single Image File: "config_library/unified/examples/letter1.jpg"
  • Local Directory with Multiple Images: "config_library/unified/examples/letters/"
  • S3 Prefix with Multiple Images: "s3://my-config-bucket/examples/letter/"
  • Direct S3 Image URI: "s3://my-config-bucket/examples/letter1.jpg"

For comprehensive details on configuring few-shot examples, including multimodal vs. text-only approaches, example management, and advanced features, refer to the few-shot-examples.md documentation.

The classification service supports configurable image dimensions for optimal performance and quality:

New Default Behavior (Preserves Original Resolution)

Section titled “New Default Behavior (Preserves Original Resolution)”

Important Change: Empty strings or unspecified image dimensions now preserve the original document resolution for maximum classification accuracy:

classification:
model: us.amazon.nova-pro-v1:0
# Image processing settings - preserves original resolution
image:
target_width: "" # Empty string = no resizing (recommended)
target_height: "" # Empty string = no resizing (recommended)

Configure specific dimensions when performance optimization is needed:

# For high-accuracy classification with controlled dimensions
classification:
image:
target_width: "1200" # Resize to 1200 pixels wide
target_height: "1600" # Resize to 1600 pixels tall
# For fast processing with lower resolution
classification:
image:
target_width: "600" # Smaller for faster processing
target_height: "800" # Maintains reasonable quality
  • Original Resolution Preservation: Empty strings preserve full document resolution for maximum accuracy
  • Aspect Ratio Preservation: Images are resized proportionally without distortion when dimensions are specified
  • Smart Scaling: Only downsizes images when necessary (scale factor < 1.0)
  • High-Quality Resampling: Better visual quality after resizing
  • Performance Optimization: Configurable dimensions allow balancing accuracy vs. speed
  • Maximum Classification Accuracy: Empty strings preserve full document resolution for best results
  • Service-Specific Tuning: Each service can use optimal image dimensions
  • Runtime Configuration: No code changes needed to adjust image processing
  • Backward Compatibility: Existing numeric values continue to work as before
  • Memory Optimization: Configurable dimensions allow resource optimization
  • Better Resource Utilization: Choose between accuracy (original resolution) and performance (smaller dimensions)

Previous Behavior: Empty strings defaulted to 951x1268 pixel resizing New Behavior: Empty strings preserve original image resolution

If you were relying on the previous default resizing behavior, explicitly set dimensions:

# To maintain previous default behavior
classification:
image:
target_width: "951"
target_height: "1268"
  1. Use Empty Strings for High Accuracy: For critical document classification, use empty strings to preserve original resolution
  2. Consider Document Types: Complex layouts benefit from higher resolution, simple text documents may work well with smaller dimensions
  3. Test Performance Impact: Higher resolution images provide better accuracy but consume more resources
  4. Monitor Processing Time: Balance classification accuracy with processing speed based on your requirements

The classification service supports both JSON and YAML output formats from LLM responses, with automatic format detection and parsing:

The system automatically detects whether the LLM response is in JSON or YAML format:

# JSON response (traditional)
classification:
task_prompt: |
Classify this document and respond with JSON:
{"class": "invoice", "confidence": 0.95}
# YAML response (more token-efficient)
classification:
task_prompt: |
Classify this document and respond with YAML:
class: invoice
confidence: 0.95

YAML format provides significant token savings:

  • 10-30% fewer tokens than equivalent JSON
  • No quotes required around keys
  • More compact syntax for nested structures
  • Natural support for multiline content

JSON-focused prompt:

classification:
system_prompt: |
You are a document classifier. Respond only with JSON format.
task_prompt: |
Classify this document and return a JSON object with the class name and confidence score.

YAML-focused prompt:

classification:
system_prompt: |
You are a document classifier. Respond only with YAML format.
task_prompt: |
Classify this document and return YAML with the class name and confidence score.
  • All existing JSON-based prompts continue to work unchanged
  • The system automatically detects and parses both formats
  • No configuration changes required for existing deployments
  • Intelligent fallback between formats if parsing fails

The classification service uses the new extract_structured_data_from_text() function which:

  • Automatically detects JSON vs YAML format
  • Provides robust parsing with multiple extraction strategies
  • Handles malformed content gracefully
  • Returns both parsed data and detected format for logging

Regex-Based Classification for Performance Optimization

Section titled “Regex-Based Classification for Performance Optimization”

Pattern 2 now supports optional regex-based classification that can provide significant performance improvements and cost savings by bypassing LLM calls when document patterns are recognized.

Document Name Regex (All Pages Same Class)

Section titled “Document Name Regex (All Pages Same Class)”

When you want all pages of a document to be classified as the same class, you can use document name regex to instantly classify entire documents based on their filename or ID:

classes:
- $schema: "https://json-schema.org/draft/2020-12/schema"
$id: Payslip
x-aws-idp-document-type: Payslip
type: object
description: "Employee wage statement showing earnings and deductions"
x-aws-idp-document-name-regex: "(?i).*(payslip|paystub|salary|wage).*"
properties:
EmployeeName:
type: string
description: "Name of the employee"

Benefits:

  • Instant Classification: Entire document classified without any LLM calls
  • Massive Performance Gains: ~100-1000x faster than LLM classification
  • Zero Token Usage: Complete elimination of API costs for matched documents
  • Deterministic Results: Consistent classification for known patterns

When document ID matches the pattern:

  • All pages are immediately classified as the matching class
  • Single section is created containing all pages
  • No backend service calls are made
  • Info logging confirms regex match

Page Content Regex (Multi-Modal Page-Level Classification)

Section titled “Page Content Regex (Multi-Modal Page-Level Classification)”

For multi-class configurations using page-level classification, you can use page content regex to classify individual pages based on text patterns:

classification:
classificationMethod: multimodalPageLevelClassification
classes:
- $schema: "https://json-schema.org/draft/2020-12/schema"
$id: Invoice
x-aws-idp-document-type: Invoice
type: object
description: "Business invoice document"
x-aws-idp-document-page-content-regex: "(?i)(invoice\\s+number|bill\\s+to|amount\\s+due)"
properties:
InvoiceNumber:
type: string
description: "Invoice number"
- $schema: "https://json-schema.org/draft/2020-12/schema"
$id: Payslip
x-aws-idp-document-type: Payslip
type: object
description: "Employee wage statement"
x-aws-idp-document-page-content-regex: "(?i)(gross\\s+pay|net\\s+pay|employee\\s+id)"
properties:
EmployeeName:
type: string
description: "Employee name"
- $schema: "https://json-schema.org/draft/2020-12/schema"
$id: Other
x-aws-idp-document-type: Other
type: object
description: "Documents that don't match specific patterns"
# No regex - will always use LLM
properties: {}

Benefits:

  • Selective Performance Gains: Pages matching patterns are classified instantly
  • Mixed Processing: Some pages use regex, others fall back to LLM
  • Cost Optimization: Reduced token usage proportional to regex matches
  • Maintained Accuracy: LLM fallback ensures all pages are properly classified

How it works:

  • Each page’s text content is checked against all class regex patterns
  • First matching pattern wins and classifies the page instantly
  • Pages with no matches use standard LLM classification
  • Results are seamlessly integrated into document sections
  1. Case-Insensitive Matching: Always use (?i) flag

    (?i).*(invoice|bill).* # Matches any case variation
  2. Flexible Whitespace: Use \\s+ for varying spaces/tabs

    (?i)(gross\\s+pay|net\\s+pay) # Handles "gross pay", "gross pay"
  3. Multiple Alternatives: Use | for different terms

    (?i).*(payslip|paystub|salary|wage).* # Any of these terms
  4. Balanced Specificity: Specific enough to avoid false matches

    # Good: Specific to W2 forms
    (?i)(form\\s+w-?2|wage\\s+and\\s+tax|employer\\s+identification)
    # Too broad: Could match many documents
    (?i)(form|wage|tax)

Use notebooks/examples/step2_classification_with_regex.ipynb to:

  • Test regex patterns against your documents
  • Compare processing speeds (regex vs LLM)
  • Analyze cost savings through token usage reduction
  • Validate classification accuracy
  • Debug pattern matching behavior

The regex system includes robust error handling:

  • Invalid Patterns: Compilation errors are logged, system falls back to LLM
  • Runtime Failures: Pattern matching errors default to LLM classification
  • Graceful Degradation: Service continues working with invalid regex
  • Comprehensive Logging: Detailed logs for debugging pattern issues

Common Document Types:

classes:
# W2 Tax Forms
- $schema: "https://json-schema.org/draft/2020-12/schema"
$id: W2
x-aws-idp-document-type: W2
type: object
description: "W2 Tax Form"
x-aws-idp-document-page-content-regex: "(?i)(form\\s+w-?2|wage\\s+and\\s+tax|social\\s+security)"
properties: {}
# Bank Statements
- $schema: "https://json-schema.org/draft/2020-12/schema"
$id: Bank-Statement
x-aws-idp-document-type: Bank-Statement
type: object
description: "Bank Statement"
x-aws-idp-document-page-content-regex: "(?i)(account\\s+number|statement\\s+period|beginning\\s+balance)"
properties: {}
# Driver Licenses
- $schema: "https://json-schema.org/draft/2020-12/schema"
$id: US-drivers-licenses
x-aws-idp-document-type: US-drivers-licenses
type: object
description: "US Driver's License"
x-aws-idp-document-page-content-regex: "(?i)(driver\\s+license|state\\s+id|date\\s+of\\s+birth)"
properties: {}
# Invoices
- $schema: "https://json-schema.org/draft/2020-12/schema"
$id: Invoice
x-aws-idp-document-type: Invoice
type: object
description: "Invoice"
x-aws-idp-document-page-content-regex: "(?i)(invoice\\s+number|bill\\s+to|remit\\s+payment)"
properties: {}
  1. Provide Clear Class Descriptions: Include distinctive features and common elements
  2. Use Few Shot Examples: Include 2-3 diverse examples per class
  3. Choose the Right Method: Use page-level with sequence segmentation for multi-document packets, holistic for context-dependent documents
  4. Balance Class Coverage: Ensure all expected document types have classes
  5. Monitor and Refine: Use the evaluation framework to track classification accuracy
  6. Consider Visual Elements: Describe visual layout and design patterns in class descriptions
  7. Test with Real Documents: Validate classification against actual document samples
  8. Optimize Image Dimensions: Configure appropriate image sizes based on document complexity and processing requirements
  9. Balance Quality vs Performance: Higher resolution images provide better accuracy but consume more resources
  10. Consider Output Format: Use YAML prompts for token efficiency, especially with complex nested responses
  11. Leverage Format Flexibility: Take advantage of automatic format detection to optimize prompts for different use cases
  12. Understand Boundary Indicators: Review the document_boundary metadata to understand how documents are being segmented
  13. Handle Multi-Document Packets: Use sequence segmentation when processing files containing multiple documents of the same type
  14. Test Segmentation Logic: Verify that documents are correctly separated by reviewing section boundaries in the results
  15. Consider Document Flow: Ensure your document classes account for typical document structures (headers, body, footers)
  16. Leverage BIO-like Tagging: Take advantage of the automatic boundary detection to eliminate manual document splitting
  17. Use Regex for Known Patterns: Add regex patterns for document types with predictable content or naming conventions
  18. Test Regex Thoroughly: Validate regex patterns against diverse document samples before production use
  19. Balance Regex Specificity: Make patterns specific enough to avoid false matches but flexible enough to catch variations
  20. Monitor Regex Performance: Track how often regex patterns match vs fall back to LLM classification