Discovery Module
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0
Discovery Module
Section titled “Discovery Module”The Discovery module is an intelligent document analysis system that automatically identifies document structures, field types, and organizational patterns to generate document processing configurations. Discovery works identically in both processing modes of the unified pattern — BDA mode (use_bda: true) and Pipeline mode (use_bda: false). When BDA mode is active, discovery also automates BDA blueprint creation and management.
Demo video (Single Doc Discovery) (4m)
Demo video (Multiple Doc Discovery) (4m)
Table of Contents
Section titled “Table of Contents”- Overview
- Architecture
- Discovery Methods
- Configuration
- Using the Discovery Module
- BDA Integration
- Best Practices
- Troubleshooting
- Limitations
Overview
Section titled “Overview”What is Discovery
Section titled “What is Discovery”The Discovery module analyzes document samples to automatically identify:
- Document Structure: Logical groupings of fields and sections
- Field Types: Data types (string, number, date, etc.) for each field
- Field Descriptions: Contextual information about field purpose and location
- Document Classes: Categorization and naming of document types
- Organizational Patterns: How fields are grouped and related
This analysis produces structured configuration templates that can be used to configure document processing workflows. The discovery process is the same regardless of whether you run in BDA mode or Pipeline mode — the only difference is that BDA mode adds automatic blueprint creation as a downstream step.
Key Features
Section titled “Key Features”- 🤖 Automated Analysis: Uses advanced LLMs to analyze document structure without manual intervention
- 📋 Configuration Generation: Creates ready-to-use configuration templates for document processing
- 🎯 Ground Truth Support: Leverages existing labeled data to improve discovery accuracy
- 📄 Multi-Section Discovery: Discover multiple document classes from a single multi-page PDF package by defining page ranges
- ✨ AI Auto-Detect Sections: Automatically identify document section boundaries using LLM analysis
- 🏷️ Class Name Hints: Pre-label discovered classes from auto-detect or manual labels
- 🔧 Configurable Models: Supports multiple Bedrock models with customizable parameters
- 📝 Custom Prompts: Allows fine-tuning of analysis behavior through prompt engineering
- 🔄 Iterative Refinement: Enables progressive improvement of document understanding
- 🌐 Multi-Format Support: Handles PDF documents and various image formats
- ⚡ Real-Time Processing: Provides immediate feedback through the web interface
- 📊 PDF Page Thumbnails: Visual page preview with color-coded range highlighting in the browser
- 🔗 BDA Blueprint Automation: Automatic BDA blueprint creation when running in BDA mode
- 📦 Multi-Document Collection Discovery: Discover document classes from a collection of documents using embedding-based clustering (local or S3)
Use Cases
Section titled “Use Cases”New Document Type Onboarding:
- Analyze sample documents to understand structure
- Generate initial processing configuration
- Reduce time-to-production for new document types
Configuration Optimization:
- Improve existing document processing accuracy
- Identify missing fields or incorrect field types
- Refine field descriptions and groupings
Document Understanding:
- Explore unknown document formats
- Understand complex document structures
- Document field relationships and dependencies
Rapid Prototyping:
- Quickly bootstrap new document processing workflows
- Test processing approaches with minimal setup
- Validate document processing concepts
Architecture
Section titled “Architecture”Core Components
Section titled “Core Components”Discovery Processor Lambda (src/lambda/discovery_processor/index.py):
- Handles discovery jobs from the SQS queue
- Orchestrates document analysis workflow using common services
- Consistent job status management and error reporting
- Triggers configuration updates (including BDA blueprint automation in BDA mode)
Classes Discovery Service (lib/idp_common_pkg/idp_common/discovery/classes_discovery.py):
- Core discovery engine for document analysis and structure identification
- LLM-powered document understanding with configurable Bedrock models
- Generates standardized document class definitions
- Supports both guided (ground truth) and unguided discovery methods
Discovery Panel UI (src/ui/src/components/discovery/DiscoveryPanel.jsx):
- Unified web interface for all discovery operations
- Real-time job status tracking via GraphQL subscriptions
- PDF page thumbnail rendering with color-coded range highlighting
- Configuration export and integration
Discovery Tracking Table (DynamoDB):
- Job status tracking and progress monitoring
- Metadata storage for job information
- Enables real-time UI updates via event coordination
Configuration Table (DynamoDB):
- Discovered classes are stored as “custom” configuration classes
- Shared across both BDA and Pipeline processing modes
BDA Blueprint Automation (BDA mode only):
- BDA Discovery Function (
patterns/unified/src/bda_discovery_function/index.py): Processes configuration update events and manages BDA blueprints - BDA Blueprint Service (
lib/idp_common_pkg/idp_common/bda/bda_blueprint_service.py): Blueprint lifecycle management, schema conversion, and project synchronization - Schema Converter (
lib/idp_common_pkg/idp_common/bda/schema_converter.py): Transforms discovery results to BDA-compatible schemas
Processing Flow
Section titled “Processing Flow”graph TD A[Document Upload] --> B[Discovery Job Creation] B --> C[SQS Queue] C --> D[Discovery Processor Lambda] D --> E{Ground Truth Available?} E -->|Yes| F[Discovery with Ground Truth] E -->|No| G[Discovery without Ground Truth] F --> H[Bedrock LLM Analysis] G --> H H --> I[Structure Extraction] I --> J[Configuration Generation] J --> K[Configuration Table Update] K --> L{BDA Mode?} L -->|Yes| M[BDA Blueprint Automation] M --> N[Create/Update Blueprints] N --> O[Job Completion] L -->|No| O O --> P[UI Notification]Integration Points
Section titled “Integration Points”S3 Integration:
- Document storage and retrieval
- Ground truth file processing
- Result artifact storage
DynamoDB Integration:
- Job tracking and status management
- Configuration storage and retrieval
- Metadata persistence
Bedrock Integration:
- LLM-powered document analysis
- Configurable model selection
- Prompt-based structure extraction
GraphQL/AppSync Integration:
- Real-time job status updates
- UI synchronization and notifications
- Configuration management APIs
Discovery Methods
Section titled “Discovery Methods”Discovery Without Ground Truth
Section titled “Discovery Without Ground Truth”This method analyzes documents from scratch without any prior knowledge or labeled examples.
How it Works:
- Document is processed through OCR or direct text extraction
- LLM analyzes the document structure and content
- Fields are identified based on visual layout and text patterns
- Document class and description are generated automatically
- Field groupings and relationships are determined
Best For:
- Completely new document types
- Exploratory analysis of unknown formats
- Initial document understanding
- Rapid prototyping scenarios
Configuration Example:
discovery: without_ground_truth: model_id: "us.amazon.nova-pro-v1:0" temperature: 1.0 top_p: 0.1 max_tokens: 10000 system_prompt: >- You are an expert in processing forms. Analyze forms line by line to identify field names, data types, and organizational structure. Focus on creating comprehensive blueprints for document processing without extracting actual values. user_prompt: >- This image contains forms data. Analyze the form line by line...Output Structure:
{ "document_class": "W4-Form", "document_description": "Employee withholding allowance certificate", "groups": [ { "name": "EmployeeInformation", "description": "Personal details of the employee", "attributeType": "group", "groupType": "normal", "groupAttributes": [ { "name": "FirstName", "dataType": "string", "description": "Employee's first name from line 1" } ] } ]}Discovery With Ground Truth
Section titled “Discovery With Ground Truth”This method uses existing labeled data or known field definitions to optimize the discovery process.
How it Works:
- Ground truth data is loaded from S3 (JSON format)
- Document is analyzed with reference to expected fields
- LLM matches document structure to ground truth patterns
- Field descriptions and types are refined based on known data
- Missing or additional fields are identified and documented
Best For:
- Improving existing configurations
- Leveraging known document structures
- Ensuring consistency with established patterns
- Optimizing field extraction accuracy
Ground Truth Format:
{ "document_class": "W4Form", "employee_name": "John Smith", "ssn": "123-45-6789", "address": "123 Main Street", "filing_status": "Single", "dependents": 0}Configuration Example:
discovery: with_ground_truth: model_id: "us.amazon.nova-pro-v1:0" temperature: 1.0 top_p: 0.1 max_tokens: 10000 system_prompt: >- You are an expert in processing forms. Use provided ground truth data as reference to optimize field extraction and ensure consistency with expected document structure and field definitions. user_prompt: >- This image contains unstructured data. Analyze the data line by line using the provided ground truth as reference... <GROUND_TRUTH_REFERENCE> {ground_truth_json} </GROUND_TRUTH_REFERENCE>Multi-Section Package Discovery
Section titled “Multi-Section Package Discovery”For multi-page document packages (e.g., lending packages, insurance packets, healthcare bundles) that contain multiple different document types, the Discovery module supports discovering multiple classes from a single PDF by defining page ranges.
Discovery Modes
Section titled “Discovery Modes”When a PDF file is selected, the UI presents two discovery modes:
- Single Section Document: Discovers one class from the entire document (with optional ground truth). This is the original behavior.
- Multi-Section Package: Define page ranges to discover multiple classes from different sections of the document. Each range creates a separate, independent discovery job.
Page Range Selection
Section titled “Page Range Selection”In Multi-Section Package mode, the UI displays:
- PDF Page Thumbnails — rendered in the browser using
pdfjs-dist, showing a visual grid of all pages with color-coded highlighting for each defined range - Page Range Inputs — editable start/end page numbers for each range
- Document Type Labels — optional text field per range for labeling the document type (e.g., “W2 Form”, “Invoice”). When provided, the label is used as a class name hint for the discovery LLM.
AI Auto-Detect Sections
Section titled “AI Auto-Detect Sections”The ”✨ Auto-detect sections” button uses an LLM to automatically identify document section boundaries:
- The PDF is uploaded to S3
- The
autoDetectSectionsGraphQL mutation calls a Lambda that sends the full PDF to Bedrock - The LLM returns a JSON array of sections:
[{"start": 1, "end": 2, "type": "Letter"}, ...] - Page ranges are auto-populated with the LLM’s boundary detection, including type labels
- User can review, adjust ranges, edit labels, then click “Start Discovery”
The auto-detect prompt is fully configurable via the Discovery Configuration in View/Edit Configuration (discovery.auto_split section).
Configuration
Section titled “Configuration”discovery: auto_split: model_id: "us.amazon.nova-pro-v1:0" temperature: 0.0 # Low temperature for consistent boundary detection top_p: 0.1 max_tokens: 4096 system_prompt: >- You are an expert document analyst. Your task is to identify distinct document sections within a multi-page document package. Return only valid JSON. user_prompt: >- Analyze this multi-page document package. Identify the page boundaries where different document types or sections begin and end...API Usage
Section titled “API Usage”from idp_common.discovery.classes_discovery import ClassesDiscovery
discovery = ClassesDiscovery( input_bucket="my-bucket", input_prefix="lending_package.pdf", version="my-config-version")
# Auto-detect section boundariessections = discovery.auto_detect_sections( input_bucket="my-bucket", input_prefix="lending_package.pdf")# Returns: [{"start": 1, "end": 2, "type": "Letter"}, {"start": 3, "end": 5, "type": "W2 Form"}, ...]
# Discover a specific page range with class name hintresult = discovery.discovery_classes_with_document( input_bucket="my-bucket", input_prefix="lending_package.pdf", page_range="3-5", class_name_hint="W2 Form")How Page Extraction Works
Section titled “How Page Extraction Works”When a page_range is specified for a PDF, the system uses pypdfium2 to extract only the specified pages into a new sub-PDF before sending to the Bedrock LLM. This means:
- The LLM only sees the relevant pages, improving accuracy
- Each page range job is independent and can run in parallel
- The original document is never modified
Multi-Document Collection Discovery
Section titled “Multi-Document Collection Discovery”While the other discovery methods analyze a single document at a time, Multi-Document Collection Discovery discovers document classes from a collection of documents using embedding-based clustering. Given a folder of mixed documents (e.g., invoices, W-2s, bank statements), it automatically groups similar documents together and generates a JSON Schema and classification for each group.
Requires extra dependencies:
pip install "idp_common[multi_document_discovery]"ormake setupfrom the project root. This installs scikit-learn, scipy, numpy, strands-agents, and pypdfium2.
Minimum 2 documents per class: Clusters with fewer than 2 documents are filtered as noise. Ensure you provide at least 2 documents for each expected document type.
How It Works
Section titled “How It Works”graph LR A[Document Collection] --> B[Embed] B --> C[Cluster] C --> D[Analyze] D --> E[Reflect] E --> F[Discovered Classes + Schemas]- Embed — Each document’s first page is rendered to an image and embedded as a vector using Amazon Bedrock (
us.cohere.embed-v4:0by default) - Cluster — Embeddings are clustered using KMeans with automatic K selection via silhouette analysis (scikit-learn). Clusters with fewer than
min_cluster_size(default: 2) documents are filtered as noise. - Analyze — For each cluster, a Strands agent with Claude (
us.anthropic.claude-sonnet-4-6) examines sample document images and generates a classification name + JSON Schema definition - Reflect — The agent produces a Markdown reflection report reviewing all discovered classes, their relationships, and potential overlaps
Supported File Types
Section titled “Supported File Types”.pdf, .png, .jpg, .jpeg, .tiff, .tif, .webp
Two Execution Modes
Section titled “Two Execution Modes”| Mode | Documents Source | Use Case | Entry Point |
|---|---|---|---|
| Local | Local filesystem | CLI/SDK development, no AWS infra needed | run_local_pipeline() |
| S3 | Amazon S3 bucket | Lambda/Step Functions, production workloads | run_full_pipeline() |
Both modes produce identical MultiDocDiscoveryResult output and support the same pipeline steps.
Usage — IDP CLI
Section titled “Usage — IDP CLI”The simplest way to run multi-document discovery:
# Discover classes from a directory of documentsidp-cli discover-multidoc --dir /path/to/documents/
# With explicit filesidp-cli discover-multidoc -d invoice1.pdf -d invoice2.pdf -d w2_form.pdf -d w2_form2.pdf
# Save results to a configuration versionidp-cli discover-multidoc --dir /path/to/documents/ --save-to-config --config-version v1See IDP CLI Reference — discover-multidoc for all options.
Usage — IDP SDK
Section titled “Usage — IDP SDK”from idp_sdk import IDPClient
client = IDPClient()result = client.discovery.run_multi_doc( document_dir="/path/to/documents/", progress_callback=lambda step, data: print(f"{step}: {data}"),)
print(f"Status: {result.status}")print(f"Found {result.total_clusters} clusters from {result.total_documents} documents")
for cls in result.discovered_classes: print(f" {cls.classification} — {cls.document_count} docs") if cls.json_schema: print(f" Fields: {list(cls.json_schema.get('properties', {}).keys())}")
print(result.reflection_report)See IDP SDK Reference — discovery.run_multi_doc() for all parameters.
Usage — idp_common Directly
Section titled “Usage — idp_common Directly”from idp_common.discovery.multi_document_discovery import MultiDocumentDiscovery
discovery = MultiDocumentDiscovery( region="us-east-1", config={ "embedding_model_id": "us.cohere.embed-v4:0", "analysis_model_id": "us.anthropic.claude-sonnet-4-6", "min_cluster_size": 2, },)
# Local pipelineresult = discovery.run_local_pipeline( document_dir="/path/to/documents/", config_version="v1", # Optional: save to DynamoDB config)
# Or S3 pipelineresult = discovery.run_full_pipeline( bucket="my-bucket", prefix="documents/batch-001/",)See idp_common API Reference — MultiDocumentDiscovery for full method-level documentation.
Output
Section titled “Output”The pipeline returns a MultiDocDiscoveryResult containing:
| Field | Description |
|---|---|
discovered_classes | List of discovered classes, each with classification, json_schema, document_count, sample_doc_ids |
reflection_report | Markdown report analyzing all discovered classes |
total_documents | Total documents processed |
num_clusters | Number of clusters found |
num_failed_embeddings | Documents that failed embedding |
num_successful_schemas / num_failed_schemas | Schema generation success/failure counts |
Web UI
Section titled “Web UI”The Web UI includes a Multi-Doc Discovery tab in the Discovery panel. This tab allows you to:
- Select a directory of documents or upload multiple files
- Monitor pipeline progress with step-by-step status updates
- View discovered classes and their generated schemas
Note: The Web UI multi-doc discovery feature requires the IDP stack to be deployed with the multi-document discovery nested stack enabled.
Best For
Section titled “Best For”- Bulk onboarding: You have a folder of hundreds of mixed documents and want to automatically discover all document types
- No prior knowledge: You don’t know what types of documents are in the collection
- Classification + Schema in one step: Generates both the document class names and extraction schemas simultaneously
- Local development: Run from your workstation with just Bedrock model access — no AWS deployment needed
Choosing the Right Method
Section titled “Choosing the Right Method”| Factor | Without Ground Truth | With Ground Truth | Multi-Section Package | Multi-Document Collection |
|---|---|---|---|---|
| Input | Single document | Single document + ground truth | Single multi-page PDF | Collection of documents |
| Use Case | New document exploration | Configuration optimization | Mixed document packages | Bulk class discovery |
| Accuracy | Good for structure discovery | Higher for known patterns | Good per-section | Good for clustering |
| Speed | Fast, single-pass | Optimized with reference | Parallel per range | Minutes for 100+ docs |
| Setup Effort | Minimal | Requires ground truth | Define page ranges | Just point at a folder |
| Output | 1 class + schema | 1 class + schema | N classes + schemas | N classes + schemas |
| Best For | Unknown document types | Improving existing workflows | Known multi-doc packages | Unknown mixed collections |
| Extra Deps | None | None | None | multi_document_discovery pip extra |
Configuration
Section titled “Configuration”The Discovery module supports comprehensive configuration through the deployment template and configuration files. All settings can be customized through the web UI’s View/Edit Configuration panel or configuration files.
Model Configuration
Section titled “Model Configuration”Supported Models:
us.amazon.nova-lite-v1:0- Fast, cost-effective for simple documentsus.amazon.nova-pro-v1:0- Balanced performance and accuracy (recommended)us.amazon.nova-premier-v1:0- Highest accuracy for complex documentsus.anthropic.claude-3-haiku-20240307-v1:0- Fast processingus.anthropic.claude-3-5-sonnet-20241022-v2:0- High accuracyus.anthropic.claude-3-7-sonnet-20250219-v1:0- Latest capabilities
Model Parameters:
discovery: without_ground_truth: model_id: "us.amazon.nova-pro-v1:0" temperature: 1.0 # Creativity level (0.0-1.0) top_p: 0.1 # Nucleus sampling (0.0-1.0) max_tokens: 10000 # Maximum response lengthParameter Guidelines:
- Temperature: Use 1.0 for creative structure discovery, 0.0 for consistent results
- Top P: Lower values (0.1) for focused analysis, higher for diverse interpretations
- Max Tokens: 10000+ recommended for complex documents with many fields
Prompt Customization
Section titled “Prompt Customization”System Prompt Configuration:
discovery: without_ground_truth: system_prompt: >- You are an expert in processing forms. Extracting data from images and documents. Analyze forms line by line to identify field names, data types, and organizational structure. Focus on creating comprehensive blueprints for document processing without extracting actual values.User Prompt Configuration:
discovery: without_ground_truth: user_prompt: >- This image contains forms data. Analyze the form line by line. Image may contains multiple pages, process all the pages. Form may contain multiple name value pair in one line. Extract all the names in the form including the name value pair which doesn't have value. Organize them into groups, extract field_name, data_type and field description. Field_name should be less than 60 characters, should not have space use '-' instead of space. field_description is a brief description of the field and the location of the field like box number or line number in the form and section of the form. Field_name should be unique within the group. Add two fields document_class and document_description. For document_class generate a short name based on the document content like W4, I-9, Paystub. For document_description generate a description about the document in less than 50 words. Group the fields based on the section they are grouped in the form. Group should have attributeType as "group". If the group repeats, add an additional field groupType and set the value as "Table". Do not extract the values. Return the extracted data in JSON format.Ground Truth Prompt Features:
- Placeholder Support: Use
{ground_truth_json}for dynamic ground truth injection - Reference Integration: Automatically includes ground truth data in analysis context
- Consistency Enforcement: Ensures field names and types match reference patterns
Output Format Configuration
Section titled “Output Format Configuration”Sample JSON Structure:
discovery: output_format: sample_json: >- { "document_class": "Form-1040", "document_description": "Brief summary of the document", "groups": [ { "name": "PersonalInformation", "description": "Personal information of Tax payer", "attributeType": "group", "groupType": "normal", "groupAttributes": [ { "name": "FirstName", "dataType": "string", "description": "First Name of Taxpayer" }, { "name": "Age", "dataType": "number", "description": "Age of Taxpayer" } ] } ] }Field Types Supported:
string- Text fields, names, addressesnumber- Numeric values, amounts, quantitiesdate- Date fields in various formatsboolean- Yes/no, checkbox fieldsarray- Lists or repeated elements
Group Types:
normal- Standard field groupingsList- Repeating tabular data structures
Configuration Management
Section titled “Configuration Management”Schema Definition:
The Discovery configuration is defined in the CloudFormation template with comprehensive UI schema support:
UpdateSchemaConfig: Type: AWS::CloudFormation::CustomResource Properties: ServiceToken: !Ref UpdateConfigurationFunctionArn Schema: type: object properties: discovery: order: 5 type: object sectionLabel: Discovery Configuration description: Configuration for document class discovery functionality properties: without_ground_truth: order: 0 type: object sectionLabel: Discovery Without Ground Truth # ... detailed field definitionsUI Integration Features:
- Dropdown Model Selection: Predefined list of supported Bedrock models
- Range Validation: Temperature, top_p with proper min/max values
- Textarea Prompts: Multi-line editing for system and user prompts
- Real-time Validation: Immediate feedback on configuration changes
- Help Text: Contextual descriptions for each configuration option
Default Settings:
discovery: without_ground_truth: model_id: "us.amazon.nova-pro-v1:0" temperature: 1.0 top_p: 0.1 max_tokens: 10000 system_prompt: "You are an expert in processing forms..." user_prompt: "This image contains forms data..." with_ground_truth: model_id: "us.amazon.nova-pro-v1:0" temperature: 1.0 top_p: 0.1 max_tokens: 10000 system_prompt: "You are an expert in processing forms..." user_prompt: "This image contains unstructured data..." output_format: sample_json: "{...}"Configuration Loading Priority:
- Custom configuration from DynamoDB (if available)
- Mode-specific system defaults (
pattern-1.yamlfor BDA mode,pattern-2.yamlfor Pipeline mode) - Built-in default configuration
- Environment variable fallbacks
Customization Options:
- Model Selection: Choose models based on document complexity and processing requirements. Balance accuracy vs. cost vs. speed. Consider context window limits for large documents.
- Prompt Engineering: Customize system prompts for domain-specific terminology. Adjust user prompts for specific document layouts. Include examples or constraints in prompts.
- Parameter Tuning: Adjust temperature for consistency vs. creativity. Modify top_p for focused vs. diverse analysis. Set appropriate max_tokens for document complexity.
- Output Customization: Define custom field naming conventions. Specify required field types and formats. Configure grouping and organizational patterns.
Using the Discovery Module
Section titled “Using the Discovery Module”Web UI Interface
Section titled “Web UI Interface”Accessing Discovery:
- Navigate to the main application dashboard
- Click on the “Discovery” tab or panel
- Select a Configuration Version to save discovered classes to
- Upload a document file (PDF, PNG, JPG, TIFF)
- For PDFs, choose a Discovery Mode:
- Single Section Document — discovers one class from the whole document; optionally upload a ground truth JSON file
- Multi-Section Package — define page ranges (manually or via ✨ Auto-detect) to discover multiple classes
- Click “Start Discovery” (or “Start Discovery (N sections)” for multi-section)
- Monitor progress in real-time in the Discovery Jobs table below
Monitoring Progress:
- Real-time progress messages via GraphQL subscriptions (e.g., “Analyzing document structure with AI…”, “Saving to configuration…”)
- Live elapsed time counter for active jobs
- Discovered document class name shown as a green badge on success (e.g.,
W4-Form) - Failure root cause displayed in expandable error details with user-friendly messages
- Search/filter bar to find jobs by document name, config version, status, or class name
- Time range selector (Last hour, 24 hours, 2 days, 7 days, All time)
- Pagination with configurable page size
- Resizable columns and column visibility preferences (settings gear icon)
- Multi-select with delete capability to clean up old jobs
Reviewing Results:
- Discovered class name prominently displayed as a badge in the Result column
- Config Version hyperlinked to the configuration editor
- Original document filename displayed (timestamp prefix stripped)
- Duration column showing total processing time
- Export options for configuration integration
- Comparison with existing configurations
API Integration
Section titled “API Integration”GraphQL Mutations:
mutation StartDiscoveryJob($input: DiscoveryJobInput!) { startDiscoveryJob(input: $input) { jobId status message }}Job Status Subscription:
subscription OnDiscoveryJobStatusChange($jobId: ID!) { onDiscoveryJobStatusChange(jobId: $jobId) { jobId status progress errorMessage result }}Direct API Usage:
from idp_common.discovery.classes_discovery import ClassesDiscovery
# Initialize with configurationdiscovery = ClassesDiscovery( input_bucket="my-documents", input_prefix="sample-form.pdf", config=my_discovery_config, region="us-west-2")
# Run discovery without ground truthresult = discovery.discovery_classes_with_document( input_bucket="my-documents", input_prefix="sample-form.pdf")
# Run discovery with ground truthresult = discovery.discovery_classes_with_document_and_ground_truth( input_bucket="my-documents", input_prefix="sample-form.pdf", ground_truth_key="ground-truth.json")Processing Results
Section titled “Processing Results”Result Structure:
{ "status": "SUCCESS", "jobId": "discovery-job-12345", "message": "Discovery completed successfully", "configuration": { "document_class": "W4Form", "document_description": "Employee withholding certificate", "groups": [...], "metadata": { "processing_time": "45.2s", "model_used": "us.amazon.nova-pro-v1:0", "confidence_score": 0.92 } }}Integration Options:
- Direct Configuration Update: Automatically update existing configuration
- Export for Review: Download configuration for manual review and editing
- Merge with Existing: Combine with current document class definitions
- Create New Class: Add as new document type to existing configuration
BDA Integration
Section titled “BDA Integration”When running in BDA mode (use_bda: true), discovery provides additional automation for BDA blueprint management. This section covers BDA-specific features that are not active in Pipeline mode.
Automated Blueprint Creation
Section titled “Automated Blueprint Creation”When discovery completes in BDA mode, the system automatically:
- Analyzes Discovery Results: Processes the discovery output
- Converts to BDA Schema: Transforms field definitions to BDA-compatible format
- Creates/Updates Blueprints: Manages blueprint lifecycle in the BDA project
- Versions Blueprints: Automatically creates new versions when changes are detected
- Integrates with Project: Ensures blueprints are available for document processing
Blueprint Naming Convention:
{StackName}-{DocumentClass}-{UniqueId}Example: IDP-W4Form-a1b2c3d4BDA Blueprint Automation Flow:
graph TD A[View/Edit Configuration UI] --> B[Save Changes] B --> C[Configuration Update Event] C --> D[BDA Discovery Lambda — Blueprint Service] D --> E{Blueprint Exists?} E -->|Yes| F[Check for Changes] E -->|No| G[Create New Blueprint] F -->|Changes Found| H[Update Blueprint] F -->|No Changes| N[Skip Update] G --> I[Schema Converter] H --> I I --> J[Generate BDA Schema] J --> K[Create/Update in BDA] K --> L[Create Blueprint Version] L --> M[Update Project] M --> N[Success Response]Intelligent Update Detection
Section titled “Intelligent Update Detection”The system only updates blueprints when actual changes are detected:
Change Detection Logic:
- Document Class Changes: Name or description modifications
- Field Changes: New fields, modified descriptions, or data type changes
- Group Changes: Structural changes in field organization
- Schema Changes: Any modification that affects BDA blueprint structure
Update Process:
# Example of intelligent update detectionif blueprint_exists: if self._check_for_updates(custom_class, existing_blueprint): # Update existing blueprint self.blueprint_creator.update_blueprint( blueprint_arn=blueprint_arn, stage="LIVE", schema=json.dumps(blueprint_schema) ) # Create new version self.blueprint_creator.create_blueprint_version( blueprint_arn=blueprint_arn, project_arn=self.dataAutomationProjectArn ) else: logger.info("No updates needed - blueprint unchanged")BDA Schema Conversion
Section titled “BDA Schema Conversion”Field Type Mapping: Discovery field types are automatically converted to BDA-compatible formats:
| Discovery Type | BDA Schema Type | Description |
|---|---|---|
string | string | Text fields, names, addresses |
number | number | Numeric values, amounts |
date | string with date format | Date fields with validation |
boolean | boolean | Yes/no, checkbox fields |
array | array | Lists or repeated elements |
Group Conversion:
- Normal Groups: Converted to BDA object definitions
- Table Groups: Converted to BDA array structures with item templates
- Nested Groups: Supported through BDA schema references
Example Schema Conversion:
// Discovery Result{ "name": "W4Form", "description": "Employee withholding certificate", "groups": [ { "name": "PersonalInfo", "groupAttributes": [ { "name": "FirstName", "dataType": "string", "description": "Employee first name from line 1" } ] } ]}
// Generated BDA Schema{ "class": "W4Form", "description": "Employee withholding certificate", "definitions": { "PersonalInfo": { "type": "object", "properties": { "first-name": { "type": "string", "instruction": "Employee first name from line 1" } } } }}Blueprint Lifecycle Management
Section titled “Blueprint Lifecycle Management”Creation Workflow:
- Discovery Completion: Discovery results are generated and saved to the configuration table
- Configuration Event: BDA Discovery Function receives a configuration update event
- Blueprint Service: Processes configuration and manages blueprint lifecycle
- Schema Generation: Converts discovery results to BDA schema format
- Blueprint Creation: Creates new blueprint in BDA service
- Project Integration: Associates blueprint with BDA project
- Version Management: Creates initial blueprint version
Update Workflow:
- Change Detection: Compares new discovery results with existing blueprint
- Schema Update: Generates updated BDA schema if changes detected
- Blueprint Update: Updates existing blueprint with new schema
- Version Creation: Creates new blueprint version
- Project Sync: Ensures project references latest version
Project Association:
- Blueprints are automatically associated with the configured BDA project
- Project ARN is specified during stack deployment
- Multiple document classes can share the same BDA project
- Projects always reference the latest blueprint version
Required Permissions:
BDABlueprintPermissions: - bedrock:CreateBlueprint - bedrock:UpdateBlueprint - bedrock:CreateBlueprintVersion - bedrock:ListBlueprints - bedrock:GetBlueprint - bedrock:DeleteBlueprint - bedrock:InvokeBlueprintOptimizationAsync - bedrock:GetBlueprintOptimizationStatusMonitoring:
- Blueprint creation/update activities are logged to CloudWatch
- Schema conversion details are captured
- Error conditions are clearly documented
Blueprint Optimization
Section titled “Blueprint Optimization”The Blueprint Optimization feature uses the BDA InvokeBlueprintOptimizationAsync API to automatically improve extraction accuracy for discovered document classes. When a discovery job includes a ground truth file, the system can optimize the BDA blueprint by comparing extraction results against the ground truth and refining the blueprint schema.
How It Works
Section titled “How It Works”- Blueprint Lookup: The optimizer checks if a blueprint already exists for the discovered class in the BDA project. If found, it reuses the existing blueprint; otherwise, it creates a new one following the standard naming convention (
{StackName}-{ClassName}-{hash}). - S3 Asset Preparation: The sample document (PDF) and ground truth (JSON) S3 URIs are constructed from the discovery bucket.
- Optimization Invocation: The
InvokeBlueprintOptimizationAsyncAPI is called with the blueprint ARN, sample document, ground truth, and an output S3 prefix. - Status Polling: The system polls
GetBlueprintOptimizationStatuswith exponential backoff (5s initial, 30s max, 15-minute timeout) until a terminal state is reached. - Results Evaluation: The optimization results (stored at
{outputPrefix}/optimization_results.json) contain before/after metrics. The system comparesexactMatchandf1scores. - Schema Application: If the optimized schema shows improvement, the blueprint is updated with the new schema, a new version is created, and the IDP class configuration is updated.
Optimization Flow
Section titled “Optimization Flow”graph TD A[Discovery Completes with Ground Truth] --> B[Blueprint Optimization Lambda] B --> C{Existing Blueprint?} C -->|Yes| D[Reuse Existing Blueprint] C -->|No| E[Create New Blueprint] D --> F[Invoke Optimization API] E --> F F --> G[Poll for Completion] G --> H{Result?} H -->|Success| I[Fetch Results from S3] H -->|ServiceError/ClientError| J[Report Failure] H -->|Timeout| J I --> K{Improved?} K -->|Yes| L[Update Blueprint Schema] K -->|No| M[Keep Original Schema] L --> N[Create Blueprint Version] N --> O[Update IDP Config] O --> P[Report OPTIMIZATION_COMPLETED] M --> P J --> Q[Report OPTIMIZATION_FAILED]UI Status Display
Section titled “UI Status Display”The Discovery Panel shows optimization progress with dedicated status indicators:
| Status | UI Label | Description |
|---|---|---|
OPTIMIZATION_IN_PROGRESS | Optimizing | Optimization is running (blueprint creation, API invocation, polling) |
OPTIMIZATION_COMPLETED | Optimized | Optimization finished (improved or no improvement) |
OPTIMIZATION_FAILED | Optimization Failed | Optimization encountered an error |
The Result column shows additional context:
- Improved: Class name badge + accuracy improvement message (e.g., “exactMatch: 0.78 → 0.91”)
- No improvement: Message indicating original schema was kept
- Failed: Expandable error details
Components
Section titled “Components”BlueprintOptimizer(lib/idp_common_pkg/idp_common/bda/blueprint_optimizer.py): Core orchestrator — manages the full optimization lifecycle including blueprint lookup/creation, API invocation, polling, evaluation, and schema application.blueprint_optimizationLambda (src/lambda/blueprint_optimization/index.py): Async Lambda handler invoked by the discovery processor. Manages AppSync status updates and error reporting.OptimizationResult: Dataclass returned by the optimizer with status, metrics, blueprint ARN, and optionally the optimized schema.
Configuration
Section titled “Configuration”Blueprint optimization is disabled by default. To enable it, set both use_bda: true and enable_blueprint_optimization: true in your configuration version via the View/Edit Configuration UI or directly in the config YAML:
use_bda: trueenable_blueprint_optimization: trueWhen enabled, the optimizer uses:
- The same BDA project as the main blueprint service (per configuration version)
- The same blueprint naming convention (
{StackName}-{ClassName}-{hash}) - The discovery bucket for S3 input/output URIs
- The
bedrock-data-automationclient withboto3>=1.42.0(bundled in the Lambda function’srequirements.txt)
IAM Permissions
Section titled “IAM Permissions”The Blueprint Optimization Lambda requires these additional Bedrock permissions (configured in template.yaml):
- bedrock:InvokeBlueprintOptimizationAsync- bedrock:GetBlueprintOptimizationStatusResource ARN patterns:
- arn:${AWS::Partition}:bedrock:${AWS::Region}:${AWS::AccountId}:blueprint/*- arn:${AWS::Partition}:bedrock:${AWS::Region}:${AWS::AccountId}:blueprint-optimization-invocation/*Retry and Error Handling
Section titled “Retry and Error Handling”- S3 Eventual Consistency: The optimization results file may not be immediately available after the API reports success. The system retries up to 5 times with 2-second delays.
- Polling Timeout: If optimization doesn’t complete within 15 minutes, the result is
TIMED_OUT. - API Errors:
ServiceErrorandClientErrorfrom the BDA API are captured and reported asOPTIMIZATION_FAILED. - Blueprint Not Found: If the blueprint stage doesn’t match (must be
LIVE), the API returnsResourceNotFoundException.
BdaIDP Sync Feature
Section titled “BdaIDP Sync Feature”The BdaIDP Sync feature provides bidirectional synchronization between BDA (Bedrock Data Automation) blueprints and IDP custom classes. This feature enables seamless integration between BDA’s blueprint management system and IDP’s document class configuration, with support for AWS Standard blueprints, optimized parallel processing, and configurable Replace or Merge sync modes.
How BdaIDP Sync Works
Section titled “How BdaIDP Sync Works”The sync feature operates through the sync_bda_idp_resolver Lambda function, which orchestrates the synchronization process:
- Flexible Sync Directions: Supports three synchronization directions:
bidirectional: Syncs both directions (default, backward compatible)bda_to_idp: Syncs from BDA blueprints to IDP classes onlyidp_to_bda: Syncs from IDP classes to BDA blueprints only
- Configurable Sync Modes: Each direction supports two modes:
replace(default): Full replacement — target is aligned to match source exactly. Items not in the source are removed.merge: Additive — source items are added to the target without removing existing items.
- AWS Standard Blueprint Support: Automatically converts AWS-managed blueprints to custom blueprints
- Schema Transformation: Converts between IDP JSON Schema format and BDA blueprint format
- Change Detection: Only updates when actual schema changes are detected
- Cleanup Management: Removes orphaned blueprints that no longer have corresponding IDP classes (replace mode only)
- Parallel Processing: Uses multi-threading for improved performance with configurable worker count
Sync Process Flow
Section titled “Sync Process Flow”graph TD A[Sync Request with Direction] --> B{Sync Direction?} B -->|bda_to_idp or bidirectional| C[Phase 1: BDA to IDP Sync] B -->|idp_to_bda or bidirectional| D[Phase 2: IDP to BDA Sync]
C --> E[Retrieve BDA Blueprints] E --> F{AWS Standard Blueprints?} F -->|Yes| G[Convert AWS Blueprints in Parallel] F -->|No| H[Load Custom Classes] G --> I[Normalize AWS Blueprint Schema] I --> J[Transform to IDP Format] J --> K[Create Custom Blueprints] K --> L[Remove AWS Blueprints from Project] L --> M[Save New IDP Classes]
D --> N[Load IDP Custom Classes] N --> O[Retrieve Existing BDA Blueprints] O --> P[Process Classes in Parallel] P --> Q{Blueprint Exists?} Q -->|Yes| R[Check for Changes with DeepDiff] Q -->|No| S[Sanitize Property Names] R -->|Changes Found| T[Sanitize Property Names] R -->|No Changes| U[Skip Update] S --> V[Transform to BDA Format] T --> V V --> W[Create/Update Blueprint] W --> X[Create Blueprint Version] X --> Y[Update Project] Y --> Z[Cleanup Orphaned Blueprints] Z --> AA[Save Modified Classes]
M --> AB[Sync Complete] AA --> AB U --> ABKey Sync Features
Section titled “Key Sync Features”🔄 Flexible Sync Directions
- Bidirectional (default): Full two-way synchronization between BDA and IDP
- BDA to IDP: One-way sync from BDA blueprints to IDP classes
- IDP to BDA: One-way sync from IDP classes to BDA blueprints
- Configurable via
sync_directionparameter in API calls
🎯 Intelligent Change Detection
- Uses DeepDiff library to compare schemas and detect actual changes
- Only triggers updates when meaningful differences are found
- Prevents unnecessary blueprint versions and API calls
- Compares transformed schemas to ensure accurate change detection
🧹 Automatic Cleanup
- Removes BDA blueprints that no longer have corresponding IDP classes
- Maintains clean blueprint inventory in BDA projects
- Prevents accumulation of obsolete blueprints
- Only runs during
idp_to_bdaorbidirectionalsync
📋 Schema Transformation
- Converts IDP JSON Schema (draft 2020-12) to BDA blueprint format (draft-07)
- Handles field type mapping and structural differences
- Preserves semantic meaning across format conversions
- Bidirectional transformation support for both sync directions
🏢 AWS Standard Blueprint Management
- Automatically detects AWS-managed blueprints in BDA projects
- Converts AWS Standard blueprints to custom blueprints
- Normalizes AWS blueprint schemas to fix common issues
- Creates corresponding IDP classes for AWS blueprints
- Removes AWS blueprints from project after conversion
⚡ Parallel Processing
- Multi-threaded processing for improved performance
- Configurable worker count via
BDA_SYNC_MAX_WORKERSenvironment variable (default: 5) - Parallel blueprint creation and updates
- Parallel AWS blueprint conversion
- Thread-safe operations with proper locking mechanisms
🔧 Property Name Sanitization
- Automatically removes special characters from property names
- Ensures BDA compatibility by sanitizing field names
- Maintains mapping of original to sanitized names
- Prevents blueprint creation failures due to invalid characters
Sync Direction Configuration
Section titled “Sync Direction Configuration”The sync direction can be specified when calling the sync operation:
GraphQL API:
mutation SyncBdaIdp { syncBdaIdp(direction: "bidirectional") { success message processedClasses direction }}Python API:
from idp_common.bda.bda_blueprint_service import BdaBlueprintService
# Initialize serviceservice = BdaBlueprintService( dataAutomationProjectArn="arn:aws:bedrock:us-west-2:123456789012:project/my-project")
# Bidirectional sync (default)result = service.create_blueprints_from_custom_configuration( sync_direction="bidirectional")
# BDA to IDP only (replace mode - removes IDP classes not in BDA)result = service.create_blueprints_from_custom_configuration( sync_direction="bda_to_idp", sync_mode="replace")
# BDA to IDP only (merge mode - adds BDA classes, keeps existing IDP classes)result = service.create_blueprints_from_custom_configuration( sync_direction="bda_to_idp", sync_mode="merge")
# IDP to BDA only (replace mode - removes BDA blueprints not in IDP)result = service.create_blueprints_from_custom_configuration( sync_direction="idp_to_bda", sync_mode="replace")
# IDP to BDA only (merge mode - adds IDP classes to BDA, keeps BDA-only blueprints)result = service.create_blueprints_from_custom_configuration( sync_direction="idp_to_bda", sync_mode="merge")Sync Mode Behavior:
| Direction | Mode | Behavior |
|---|---|---|
bda_to_idp | replace (default) | IDP classes are replaced with BDA blueprints. Classes not in BDA are removed. |
bda_to_idp | merge | BDA blueprints are added to IDP. Existing IDP classes are kept. |
idp_to_bda | replace (default) | BDA blueprints are replaced with IDP classes. Orphaned blueprints are deleted. |
idp_to_bda | merge | IDP classes are pushed to BDA. Existing BDA-only blueprints are kept. |
Environment Configuration:
# Configure maximum parallel workers (default: 5)BDA_SYNC_MAX_WORKERS=10AWS Standard Blueprint Conversion
Section titled “AWS Standard Blueprint Conversion”The sync feature includes automatic conversion of AWS Standard blueprints to custom blueprints:
Conversion Process:
- Detection: Identifies AWS-managed blueprints (containing
aws:blueprintin ARN) - Normalization: Fixes common issues in AWS blueprint schemas:
- Adds missing
$schemafield (draft-07) - Adds missing
typefields to root and definitions - Adds missing
instructionfields to$refproperties - Fixes array items with BDA-specific fields
- Fixes double-escaped quotes in instruction strings
- Adds missing
- Transformation: Converts normalized BDA schema to IDP class format
- Blueprint Creation: Creates new custom blueprint from transformed schema
- Project Update: Removes AWS blueprint and adds custom blueprint to project
- Configuration Save: Saves new IDP class to configuration table
Schema Normalization Examples:
# Before normalization (AWS blueprint){ "definitions": { "Address": { "properties": { "street": { "$ref": "#/definitions/Street" # Missing instruction field }, "items": { "type": "array", "items": { "type": "string", "inferenceType": "explicit", # Should not be in items "instruction": "Item description" } } } } }}
# After normalization{ "$schema": "http://json-schema.org/draft-07/schema#", # Added "type": "object", # Added "definitions": { "Address": { "type": "object", # Added "properties": { "street": { "$ref": "#/definitions/Street", "instruction": "-" # Added }, "items": { "type": "array", "instruction": "-", # Added "inferenceType": "explicit", # Moved to array level "items": { "type": "string" # Cleaned up } } } } }}Parallel Conversion:
- AWS blueprints are converted in parallel using ThreadPoolExecutor
- Configurable worker count (default: min(3, BDA_SYNC_MAX_WORKERS))
- Thread-safe operations with proper locking
- Skips blueprints that already have corresponding IDP classes
Limitations and Constraints
Section titled “Limitations and Constraints”BDA Schema Limitations
Section titled “BDA Schema Limitations”Nested Objects Not Supported: BDA currently has limitations with complex nested structures that affect sync operations:
// ❌ NOT SUPPORTED: Nested objects within objects{ "employee": { "type": "object", "properties": { "personalInfo": { "type": "object", "properties": { "name": {"type": "string"}, "address": {"type": "string"} } } } }}
// ✅ SUPPORTED: Flat object structure{ "employee": { "type": "object", "properties": { "name": {"type": "string"}, "address": {"type": "string"}, "department": {"type": "string"} } }}Nested Arrays Not Supported: Arrays within object definitions are not supported by BDA:
// ❌ NOT SUPPORTED: Arrays within object definitions{ "Employee": { "type": "object", "properties": { "shifts": { "type": "array", "items": {"$ref": "#/$defs/Shift"} } } }}
// ✅ SUPPORTED: Top-level arrays{ "employees": { "type": "array", "items": {"$ref": "#/$defs/Employee"} }}AWS Standard Blueprint Handling
Section titled “AWS Standard Blueprint Handling”- AWS-provided blueprints (identifiable by
aws:blueprintin ARN) are read-only - During
bda_to_idporbidirectionalsync, AWS blueprints are automatically converted to custom blueprints, transformed into IDP classes, and removed from the BDA project after successful conversion - Conversion only occurs if no corresponding IDP class exists
- Failed conversions are logged but don’t stop the sync process
Sync Performance
Section titled “Sync Performance”Multi-Threading Configuration:
# Configure in environmentBDA_SYNC_MAX_WORKERS=10 # Default: 5
# Processing breakdown:# - IDP to BDA sync: Uses max_workers threads# - AWS blueprint conversion: Uses min(3, max_workers) threads# - Thread-safe operations with locking mechanismsPerformance Characteristics:
| Operation | Processing Mode | Default Workers | Typical Time |
|---|---|---|---|
| IDP to BDA Sync | Parallel | 5 | 2-5s per class |
| AWS Blueprint Conversion | Parallel | 3 | 3-7s per blueprint |
| Change Detection | Sequential | N/A | <1s per class |
| Schema Transformation | Sequential | N/A | <1s per class |
Optimization Tips:
- Increase
BDA_SYNC_MAX_WORKERSfor large numbers of classes (10-20 recommended) - Monitor CloudWatch logs for thread execution times
- Consider sync direction to avoid unnecessary operations
- Use
idp_to_bdawhen only updating blueprints from IDP changes - Use
bda_to_idpwhen only importing AWS blueprints or BDA changes
Best Practices for Sync
Section titled “Best Practices for Sync”1. Choose Appropriate Sync Direction:
# After modifying IDP classes in UIservice.create_blueprints_from_custom_configuration( sync_direction="idp_to_bda" # Only update BDA blueprints)
# After adding AWS Standard blueprints to BDA projectservice.create_blueprints_from_custom_configuration( sync_direction="bda_to_idp" # Only import to IDP)
# For complete synchronizationservice.create_blueprints_from_custom_configuration( sync_direction="bidirectional" # Full two-way sync)2. Use Simplified IDP Schemas:
Flatten complex structures — avoid nested objects and place arrays at the top level only:
{ "properties": { "employees": { "type": "array", "description": "List of employees", "items": {"$ref": "#/$defs/Employee"} } }, "$defs": { "Employee": { "type": "object", "properties": { "name": {"type": "string"}, "id": {"type": "string"} } } }}3. Pre-Sync Validation Checklist:
- ✅ No nested objects within object definitions
- ✅ No arrays within object definitions
- ✅ All array properties have description or instruction fields
- ✅ Field names follow BDA naming conventions (no special characters like &, /)
- ✅ Schema uses supported data types (string, number, boolean)
- ✅ Property names are less than 60 characters
- ✅ No double-escaped quotes in instruction strings
4. Schema Design — Recommended Pattern:
{ "$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "SimpleInvoice", "type": "object", "description": "Simple invoice document", "properties": { "invoiceNumber": { "type": "string", "description": "Invoice number" }, "invoiceDate": { "type": "string", "description": "Invoice date" }, "lineItems": { "type": "array", "description": "Invoice line items", "items": {"$ref": "#/$defs/LineItem"} } }, "$defs": { "LineItem": { "type": "object", "properties": { "description": {"type": "string", "description": "Item description"}, "quantity": {"type": "number", "description": "Item quantity"}, "unitPrice": {"type": "number", "description": "Unit price"}, "totalPrice": {"type": "number", "description": "Total price"} } } }}Troubleshooting Sync Issues
Section titled “Troubleshooting Sync Issues”Common Error Patterns:
| Error | Cause | Solution |
|---|---|---|
Skipping nested object property 'personalInfo' | Nested objects not supported by BDA | Flatten the object structure into individual fields |
Array property missing required 'instruction' field | Missing metadata | Handled automatically (defaults to ”-“) |
BDA schema validation failed | Invalid schema format | Ensure schema follows BDA draft-07 requirements |
Property name contains invalid characters | Special characters in names | Automatically sanitized; check logs for name mapping |
Failed to normalize AWS blueprint schema | Unsupported AWS blueprint structures | Check AWS blueprint format; may need manual intervention |
Thread execution error during parallel processing | Parallel processing failure | Check CloudWatch logs; consider reducing BDA_SYNC_MAX_WORKERS |
Debugging Steps:
- Check CloudWatch Logs: Review
sync_bda_idp_resolverandBdaBlueprintServicelogs for detailed error messages, thread execution logs, and property name sanitization mappings. - Validate Schema Structure: Use JSON Schema validators; look for nested objects/arrays in definitions; check for special characters in property names.
- Test with Simplified Schema: Start with a minimal test schema, verify sync works, then gradually add complexity.
- Verify Sync Direction: Confirm the correct direction for your use case; test each direction independently if issues occur.
Best Practices
Section titled “Best Practices”Document Selection
Section titled “Document Selection”Choose Representative Samples:
- Select documents that represent typical variations
- Include both simple and complex examples
- Ensure all important sections are represented
- Use high-quality, clear document images
Document Quality Guidelines:
- Resolution: Minimum 150 DPI for text clarity
- Format: PDF preferred, high-quality images acceptable
- Completeness: Include all pages of multi-page documents
- Legibility: Ensure text is readable and not corrupted
Sample Size Recommendations:
- Single Document: Good for initial exploration
- 2-3 Documents: Better for understanding variations
- 5+ Documents: Optimal for comprehensive analysis
- Different Layouts: Include various form versions if available
Ground Truth Preparation
Section titled “Ground Truth Preparation”JSON Format Requirements:
{ "document_class": "FormName", "field_name_1": "expected_value_1", "field_name_2": "expected_value_2", "nested_object": { "sub_field": "sub_value" }, "array_field": ["item1", "item2"]}Best Practices:
- Field Names: Use descriptive, consistent naming conventions
- Data Types: Include examples of all expected data types
- Completeness: Cover all important fields in the document
- Accuracy: Ensure ground truth data is correct and validated
- Structure: Reflect the logical organization of document fields
Ground Truth Sources:
- Existing database schemas or data models
- Manual annotation of sample documents
- Previous extraction results (validated)
- Domain expert knowledge and requirements
Configuration Tuning
Section titled “Configuration Tuning”Model Selection Guidelines:
- Nova Lite: Simple forms with clear structure
- Nova Pro: Most document types (recommended default)
- Nova Premier: Complex layouts, handwritten content
- Claude Models: Alternative for specific use cases
Parameter Optimization:
# For consistent, structured outputdiscovery: without_ground_truth: temperature: 0.0 # Low creativity top_p: 0.1 # Focused sampling
# For creative structure discoverydiscovery: without_ground_truth: temperature: 1.0 # High creativity top_p: 0.3 # Diverse samplingPrompt Engineering Tips:
- Be Specific: Clearly define expected field types and formats
- Include Examples: Show desired output structure in prompts
- Set Constraints: Specify field naming conventions and limitations
- Domain Context: Include relevant domain knowledge and terminology
Iterative Improvement:
- Start with default configuration
- Run discovery on sample documents
- Review and validate results
- Adjust prompts and parameters based on findings
- Re-run discovery to validate improvements
- Repeat until satisfactory results are achieved
Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”Issue: Discovery Job Fails to Start
Symptoms: Job status remains "PENDING" or shows "FAILED" immediatelyCauses:- Invalid document format or corrupted file- Insufficient permissions for S3 access- Missing or invalid configuration
Solutions:- Verify document is valid PDF or supported image format- Check S3 bucket permissions and access policies- Validate configuration syntax and required fields- Review CloudWatch logs for specific error messagesIssue: BDA Blueprint Creation Fails (BDA mode only)
Symptoms: Discovery completes but BDA blueprints are not created/updatedCauses:- Missing BDA permissions- Invalid BDA project ARN- Schema conversion errors- BDA service throttling
Solutions:- Verify BDA permissions in IAM role- Check BDA project ARN in stack parameters- Review BDA Discovery Function logs- Implement retry logic for throttling- Validate generated schema formatIssue: Configuration Events Not Processing
Symptoms: Discovery completes but configuration updates don't occurCauses:- SQS queue configuration issues- Lambda function errors- Event routing problems- Permission issues
Solutions:- Check SQS queue visibility and permissions- Review Lambda function logs- Verify event source mappings- Validate IAM permissions for event processingIssue: Poor Field Detection Quality
Symptoms: Missing fields, incorrect field types, poor groupingCauses:- Document quality issues (low resolution, poor scan)- Inappropriate model selection for document complexity- Generic prompts not suited for document type
Solutions:- Use higher resolution documents (minimum 150 DPI)- Try different models (Nova Premier for complex documents)- Customize prompts with domain-specific terminology- Provide ground truth data for better guidanceIssue: Inconsistent Results Between Runs
Symptoms: Different field structures on repeated analysisCauses:- High temperature setting causing creative variation- Ambiguous document structure or layout- Insufficient prompt constraints
Solutions:- Reduce temperature to 0.0 for consistent results- Add more specific constraints in user prompts- Use ground truth data to establish expected patterns- Include examples in prompts for guidanceIssue: Timeout or Performance Problems
Symptoms: Jobs taking too long or timing outCauses:- Large documents exceeding processing limits- Complex layouts requiring extensive analysis- Model capacity or throttling issues
Solutions:- Split large documents into smaller sections- Use faster models (Nova Lite) for initial analysis- Implement retry logic with exponential backoff- Consider preprocessing to simplify document structureError Handling
Section titled “Error Handling”Configuration Validation:
def validate_discovery_config(config): """Validate discovery configuration before processing.""" required_fields = ['model_id', 'system_prompt', 'user_prompt']
for scenario in ['without_ground_truth', 'with_ground_truth']: scenario_config = config.get('discovery', {}).get(scenario, {}) for field in required_fields: if not scenario_config.get(field): raise ValueError(f"Missing required field: {field} in {scenario}")
# Validate model ID supported_models = [ 'us.amazon.nova-lite-v1:0', 'us.amazon.nova-pro-v1:0', 'us.amazon.nova-premier-v1:0', # ... other supported models ]
model_id = scenario_config.get('model_id') if model_id not in supported_models: raise ValueError(f"Unsupported model: {model_id}")Graceful Degradation:
def discovery_with_fallback(discovery_service, document_key, ground_truth_key=None): """Attempt discovery with fallback strategies.""" try: # Try with ground truth if available if ground_truth_key: return discovery_service.discovery_classes_with_document_and_ground_truth( input_bucket, document_key, ground_truth_key ) else: return discovery_service.discovery_classes_with_document( input_bucket, document_key ) except Exception as e: logger.warning(f"Discovery failed with error: {e}")
# Fallback to simpler model or configuration fallback_config = get_fallback_config() fallback_service = ClassesDiscovery( input_bucket=input_bucket, input_prefix=document_key, config=fallback_config )
return fallback_service.discovery_classes_with_document( input_bucket, document_key )Limitations
Section titled “Limitations”Known Limitations
Section titled “Known Limitations”Configuration Table:
- Discovery feature stores all custom classes as an array in Configuration table with “custom” key.
DynamoDB has hard limit of 400 KB per item.Resolved: Configuration data is now gzip-compressed before storing to DynamoDB, achieving 37-95x compression ratios. Configurations with 3,000+ document classes fit comfortably within the 400KB limit.
Discovery Output Format:
- Output format is configured via View/Edit Configuration. JSON format should follow custom classes format.
- Output in any other format will result in failure.
Production Usage:
- We recommend not to use the Discovery module in production. This is to reduce the risk of any hallucination during the document discovery.
- We recommend using the Discovery module in your lower environment to discover and construct the configurations. Export the tested configuration to production deployment.
The Discovery module provides a powerful foundation for understanding and processing new document types. By following these guidelines and best practices, you can effectively leverage the module to bootstrap document processing workflows and continuously improve their accuracy and coverage.