Evaluation Framework
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0
Evaluation Framework
Section titled “Evaluation Framework”The GenAIIDP solution includes a built-in evaluation framework to assess the accuracy of document processing outputs. This allows you to:
- Compare processing outputs against baseline (ground truth) data
- Generate detailed evaluation reports using configurable methods and thresholds
- Track and improve processing accuracy over time
Enhanced Reporting (sticker-eval v0.1.4+)
Section titled “Enhanced Reporting (sticker-eval v0.1.4+)”NEW in v0.4.9+: The evaluation module now includes enhanced reporting with fine-grain field comparison details. See evaluation-enhanced-reporting.md for:
- 📊 Nested field comparison details - See individual field matches within complex objects and arrays
- 🔍 Interactive report controls - Filter to show only unmatched rows for focused debugging
- 📋 Expandable details - Drill down into nested comparisons with expand/collapse controls
- 🎯 Aggregate score annotations - Clearly distinguish aggregate scores from simple field scores
Key Benefits:
- Quickly identify which specific nested fields cause mismatches
- Compact problem view focusing on failed matches
- Both high-level overview and granular diagnostics in one report
- Full backward compatibility with existing configurations
For complete documentation and examples, see evaluation-enhanced-reporting.md.
Table of Contents
Section titled “Table of Contents”- Evaluation Framework
- Enhanced Reporting (sticker-eval v0.1.4+)
- Stickler Evaluation Engine
- How It Works
- Dynamic Schema Generation
- Evaluation Methods
- Field Weighting for Business Criticality
- Type Coercion and Data Compatibility
- Assessment Confidence Integration
- Configuration
- Attribute Processing and Evaluation
- Document Split Classification Metrics
- Setup and Usage
- Best Practices
- Automatic Field Discovery
- Semantic vs LLM Evaluation
- Metrics and Monitoring
- Aggregate Evaluation Analytics and Reporting
- Migration from Legacy Evaluation
- Troubleshooting Evaluation Issues
Stickler Evaluation Engine
Section titled “Stickler Evaluation Engine”The evaluation framework is powered by Stickler, AWS’s open-source library for structured object comparison. Stickler provides:
- Field-Level Weighting: Prioritize accuracy on business-critical fields
- Optimal List Matching: Hungarian algorithm for comparing arrays of objects
- Extensible Comparators: Support for exact, fuzzy, numeric, semantic, and LLM-based comparison
- Native JSON Schema Support: Direct use of JSON Schema with custom extensions
The IDP solution uses a feature branch of Stickler (commit: de7d0fda) that adds JSON Schema construction support. This will migrate to the main branch once PR #20 merges.
Architecture
Section titled “Architecture”flowchart TD IDP[IDP Configuration] --> Mapper[SticklerConfigMapper] Mapper --> Schema[JSON Schema with Extensions] Schema --> Stickler[Stickler StructuredModel] Baseline[Baseline Data] --> Stickler Actual[Actual Results] --> Stickler Stickler --> Comparison[Field-by-Field Comparison] Comparison --> Results[Evaluation Results]The SticklerConfigMapper translates IDP’s evaluation extensions (x-aws-idp-evaluation-*) to Stickler’s format (x-aws-stickler-*), maintaining independence from any specific evaluation backend.
How It Works
Section titled “How It Works”-
Baseline Data
- Store validated baseline data in a dedicated S3 bucket
- Use an existing bucket or let the solution create one
- Can use outputs from another GenAIIDP stack to compare different patterns/prompts
-
Integrated Evaluation Step
- Evaluation runs as the final step in the Step Functions workflow (after summarization)
- Executes before the workflow marks documents as COMPLETE, eliminating race conditions
- When
evaluation.enabled: truein configuration, evaluates against baseline data if available - When
evaluation.enabled: falsein configuration, step executes but skips processing - Generates detailed markdown reports using AI analysis
-
Evaluation Reports
- Compare section classification accuracy
- Analyze extracted field differences
- Identify patterns in discrepancies
- Assess severity of differences (cosmetic vs. substantial)
Dynamic Schema Generation
Section titled “Dynamic Schema Generation”When running evaluation without explicit configuration for a document class, the evaluation service can automatically generate a default document schema based on the structure of the expected (baseline) document. This enables evaluation even when document classes haven’t been formally configured.
How It Works
Section titled “How It Works”flowchart TD Start[Start Evaluation] --> CheckConfig{Configuration<br/>Exists?}
CheckConfig -->|Yes| UseConfig[Use Explicit Config] CheckConfig -->|No| LoadExpected[Load Expected Results]
LoadExpected --> InferSchema[Infer Schema from<br/>Expected Structure] InferSchema --> GenerateConfig[Generate Default<br/>Stickler Config] GenerateConfig --> LogWarning[Log Warning:<br/>Using Auto-Generated Schema] LogWarning --> CacheConfig[Cache for Session]
UseConfig --> Compare[Compare Results] CacheConfig --> Compare
Compare --> Annotate[Annotate Results with<br/>Auto-Generation Notice] Annotate --> Results[Return Evaluation Results]Type Inference Rules
Section titled “Type Inference Rules”The auto-generation system infers evaluation methods based on detected data types:
| Data Type | Evaluation Method | Default Threshold | Use Case |
|---|---|---|---|
string | FUZZY | 0.85 | Text fields, names, addresses |
integer | NUMERIC_EXACT | 0.01 | Counts, IDs, whole numbers |
float | NUMERIC_EXACT | 0.01 | Amounts, percentages, decimals |
boolean | EXACT | N/A | True/false flags |
object | Nested structure | N/A | Address, contact info (recursive) |
array[object] | HUNGARIAN | N/A | Transactions, line items (optimal matching) |
array[primitive] | Simple array | N/A | Tags, categories, lists |
null | EXACT (string) | N/A | Optional fields, missing values |
Auto-Generated Schema Example
Section titled “Auto-Generated Schema Example”Input Data:
{ "invoice_number": "INV-12345", "amount": 1250.50, "customer_address": { "street": "123 Main St", "city": "Seattle" }, "line_items": [ {"description": "Widget", "price": 10.50} ]}Generated Schema:
{ "$schema": "https://json-schema.org/draft/2020-12/schema", "type": "object", "x-aws-idp-document-type": "Invoice", "x-aws-idp-evaluation-match-threshold": 0.8, "properties": { "invoice_number": { "type": "string", "x-aws-idp-evaluation-method": "FUZZY", "x-aws-idp-evaluation-threshold": 0.85 }, "amount": { "type": "number", "x-aws-idp-evaluation-method": "NUMERIC_EXACT", "x-aws-idp-evaluation-threshold": 0.01 }, "customer_address": { "type": "object", "properties": { "street": { "type": "string", "x-aws-idp-evaluation-method": "FUZZY" }, "city": { "type": "string", "x-aws-idp-evaluation-method": "FUZZY" } } }, "line_items": { "type": "array", "x-aws-idp-evaluation-method": "HUNGARIAN", "items": { "type": "object", "properties": { "description": {"type": "string"}, "price": {"type": "number"} } } } }}Result Annotation
Section titled “Result Annotation”When using auto-generated schemas, all evaluation results include a clear annotation in the reason field:
With Auto-Generated Schema:
"reason": "Exact match. Note: Schema inferred (no config)"With Explicit Configuration:
"reason": "Exact match"This transparency ensures users can immediately identify which evaluations used carefully crafted configurations versus quick defaults.
When to Use Auto-Generation
Section titled “When to Use Auto-Generation”✅ Recommended For:
- Exploratory Testing: Quick evaluation without config setup
- Development Iteration: Fast feedback during development
- Schema Discovery: Understanding document structure before formalizing config
- Ad-hoc Analysis: One-time evaluations without production setup
⚠️ Not Recommended For:
- Production Workloads: Lacks domain-specific comparison logic
- Complex Requirements: Cannot encode business rules or custom thresholds
- Regulatory Compliance: Explicit configs provide audit trail and governance
Logging and Monitoring
Section titled “Logging and Monitoring”Auto-generation events are logged with WARNING level for visibility:
WARNING: Auto-generated schema for document class 'Invoice' from expected data structure.For production use, please define an explicit configuration. Generated 8 properties.Implementation Details
Section titled “Implementation Details”The dynamic schema generation uses:
- genson library: Industry-standard, production-ready JSON Schema generator
- Automatic type detection: Handles strings, numbers, booleans, nested objects, arrays
- Union type support: Correctly handles mixed-type fields (e.g.,
["string", "integer"]) - Robust edge case handling: Empty arrays, null values, deeply nested structures
Evaluation Methods
Section titled “Evaluation Methods”The framework supports multiple comparison methods. All method names in evaluation reports use PascalCase formatting (e.g., Fuzzy, NumericExact) to match the UI configuration interface.
Supported Methods and Their Characteristics
Section titled “Supported Methods and Their Characteristics”The evaluation framework provides different comparison methods optimized for various data types and use cases:
| Method | Use Case | Uses Threshold? | Default Value | Description |
|---|---|---|---|---|
| Exact | IDs, codes, booleans | No | N/A | Character-by-character comparison after normalizing whitespace and punctuation. Returns 1.0 for match, 0.0 otherwise |
| NumericExact | Amounts, quantities | No (uses tolerance) | N/A | Numeric comparison with configurable tolerance for rounding differences. Uses absolute/relative tolerance, not similarity threshold |
| Fuzzy | Names, addresses, general text | Yes | 0.7 | Token-based fuzzy matching allowing minor variations and reordering. Threshold controls minimum similarity score |
| Levenshtein | Text with typos, variations | Yes | 0.7 | Edit distance-based string comparison for detecting character-level differences. Threshold controls minimum similarity score |
| Semantic | Descriptions, free text | Yes | 0.7 | Embedding-based similarity using Bedrock Titan embeddings for meaning comparison. Threshold controls minimum similarity score |
| LLM | Complex semantic equivalence | No (binary) | N/A | AI-powered comparison with detailed reasoning. Returns binary match decision (1.0 or 0.0), not similarity score |
| Hungarian | Arrays of structured objects | Yes (match_threshold) | 0.8 | Optimal bipartite matching algorithm for list comparison. Uses document-level match threshold for item pairing |
| AggregateObject | Nested objects | No | N/A | Recursive field-by-field comparison of nested structures. No top-level threshold |
Threshold Display in Reports
Section titled “Threshold Display in Reports”Evaluation reports display thresholds only for methods that use similarity-based scoring:
Methods WITH Threshold Display:
Fuzzy (threshold: 0.70)- Using Stickler’s defaultFuzzy (threshold: 0.85)- Using field-specific configurationLevenshtein (threshold: 0.70)- Using defaultSemantic (threshold: 0.70)- Using defaultHungarian (threshold: 0.80)- Using document-level match threshold
Methods WITHOUT Threshold Display:
Exact- Binary comparison (no threshold concept)NumericExact- Uses tolerance, not thresholdLLM- Returns binary match decisionAggregateObject- Recursive comparison
This distinction ensures clarity: thresholds are shown only when they represent a similarity score cutoff, not for binary or tolerance-based comparisons
Field Weighting for Business Criticality
Section titled “Field Weighting for Business Criticality”Stickler supports field-level weights to prioritize accuracy on business-critical fields. Fields with higher weights contribute more to the overall evaluation score, allowing you to maintain high standards for critical data while being more tolerant of errors in less important fields.
Configuration
Section titled “Configuration”Add the x-aws-idp-evaluation-weight extension to any field in your schema:
classes: - $schema: "https://json-schema.org/draft/2020-12/schema" x-aws-idp-document-type: invoice properties: invoice_number: type: string x-aws-idp-evaluation-method: EXACT x-aws-idp-evaluation-weight: 2.0 # Critical field - double weight invoice_date: type: string x-aws-idp-evaluation-method: FUZZY x-aws-idp-evaluation-weight: 1.5 # Important field vendor_name: type: string x-aws-idp-evaluation-method: FUZZY x-aws-idp-evaluation-weight: 1.0 # Normal weight (default) vendor_notes: type: string x-aws-idp-evaluation-method: SEMANTIC x-aws-idp-evaluation-weight: 0.5 # Less critical - half weightWeighted Score Calculation
Section titled “Weighted Score Calculation”The evaluation framework calculates a weighted_overall_score that reflects field importance:
- Section-level: Aggregates weighted scores across all fields in the section
- Document-level: Averages section-level weighted scores across all sections
- Displayed in reports: Shows alongside traditional accuracy metrics
Example Calculation:
Field A: score=1.0, weight=2.0 → weighted contribution = 2.0Field B: score=0.8, weight=1.0 → weighted contribution = 0.8Field C: score=0.6, weight=0.5 → weighted contribution = 0.3
Weighted Overall Score = (2.0 + 0.8 + 0.3) / (2.0 + 1.0 + 0.5) = 3.1 / 3.5 = 0.886Benefits
Section titled “Benefits”- Business Alignment: Evaluation scores reflect business priorities, not just technical accuracy
- Flexible Tolerances: Accept minor errors in low-priority fields while demanding perfection for critical data
- Actionable Metrics: Quickly identify when important fields are failing evaluation
- Cost Optimization: Focus human review efforts on documents with low weighted scores
Best Practices
Section titled “Best Practices”- Weight Critical Fields Higher: Invoice amounts, dates, IDs should have weights ≥ 1.5
- Use Default Weight for Standard Fields: Most fields should use weight=1.0 (or omit for default)
- Reduce Weight for Optional Fields: Notes, descriptions can use weight=0.5
- Test Different Weightings: Adjust weights based on business feedback and error impact analysis
Type Coercion and Data Compatibility
Section titled “Type Coercion and Data Compatibility”The evaluation framework automatically coerces data types to match schema expectations, preventing validation errors from type mismatches between baseline and actual data.
Automatic Type Conversion
Section titled “Automatic Type Conversion”The system intelligently converts data types when baseline format differs from schema:
Common Scenarios:
- Baseline has float
1250.5but schema expects string → Converts to"1250.5" - Baseline has string
"123"but schema expects integer → Converts to123 - Baseline has string
"true"but schema expects boolean → Converts totrue - Nested objects and arrays are recursively coerced
When Type Coercion Happens
Section titled “When Type Coercion Happens”Type coercion occurs automatically during evaluation:
- Schema Analysis: System examines expected field types from JSON Schema
- Data Loading: Baseline and actual results loaded from S3
- Type Detection: Identifies mismatches between data and schema types
- Automatic Coercion: Converts values to match schema expectations
- Evaluation: Performs comparison using coerced, type-compatible values
Benefits
Section titled “Benefits”- Flexible Baselines: Use baseline data from any source without format concerns
- Focus on Semantics: Evaluation focuses on content accuracy, not type strictness
- Backward Compatibility: Works with legacy baseline data created before schema standardization
- Reduced Errors: Prevents Pydantic validation failures from type mismatches
Limitations
Section titled “Limitations”- Loss of Precision: Some conversions may lose precision (e.g., very large integers to floats)
- String Coercion: Complex objects converted to strings may not convert back perfectly
- Null Handling: Null values are preserved and handled as missing fields
Best Practices
Section titled “Best Practices”- Create Consistent Baselines: Use processed outputs as baseline templates when possible
- Test with Representative Data: Validate type coercion with real documents
- Monitor Warnings: Check logs for type coercion warnings during development
- Update Schemas: Define schema types that match your baseline data format when practical
Assessment Confidence Integration
Section titled “Assessment Confidence Integration”The evaluation framework automatically integrates with the assessment feature to provide enhanced quality insights. When documents have been processed with assessment enabled via the configuration assessment.enabled: true property, the evaluation reports include confidence scores alongside traditional accuracy metrics.
Confidence Score Display
Section titled “Confidence Score Display”The evaluation framework automatically extracts confidence scores from the explainability_info section of assessment results and displays them in both JSON and Markdown evaluation reports:
- Confidence: Confidence score for extraction results being evaluated
Enhanced Evaluation Reports
Section titled “Enhanced Evaluation Reports”When confidence data is available, evaluation reports include additional columns:
| Status | Attribute | Expected | Actual | Confidence | Score | Method | Reason || :----: | --------- | -------- | ------ | :---------------: | ----- | ------ | ------ || ✅ | invoice_number | INV-2024-001 | INV-2024-001 | 0.92 | 1.00 | EXACT | Exact match || ❌ | vendor_name | ABC Corp | XYZ Inc | 0.75 | 0.00 | EXACT | Values do not match |Quality Analysis Benefits
Section titled “Quality Analysis Benefits”The combination of evaluation accuracy and confidence scores provides deeper insights:
- Extraction Quality Assessment: Low confidence highlights extraction results requiring human verification
- Quality Prioritization: Focus improvement efforts on attributes with both low confidence and low accuracy
- Pattern Identification: Analyze relationships between confidence levels and evaluation outcomes
Backward Compatibility
Section titled “Backward Compatibility”The confidence integration is fully backward compatible:
- Evaluation reports without assessment data show “N/A” in confidence columns
- All existing evaluation workflows continue to function unchanged
- No additional configuration required to enable confidence display
Configuration
Section titled “Configuration”Stack Deployment Parameters
Section titled “Stack Deployment Parameters”Set the following parameter during stack deployment:
EvaluationBaselineBucketName: Description: Existing bucket with baseline data, or leave empty to create new bucketRuntime Configuration
Section titled “Runtime Configuration”Control evaluation behavior through the configuration file (no stack redeployment needed):
evaluation: enabled: true # Set to false to disable evaluation processing llm_method: model: "us.anthropic.claude-3-haiku-20240307-v1:0" # Model for evaluation reports temperature: "0.0" top_p: "0.0" max_tokens: "4096" # Additional model parameters...Benefits of Configuration-Based Control:
- Enable/disable evaluation without stack redeployment
- Runtime control similar to summarization and assessment features
- Zero LLM costs when disabled (step executes but skips processing)
- Consistent feature control pattern across the solution
Attribute-Specific Evaluation Methods
Section titled “Attribute-Specific Evaluation Methods”You can also configure evaluation methods for specific document classes and attributes through the solution’s configuration. The framework supports three types of attributes with different evaluation approaches:
Simple Attributes
Section titled “Simple Attributes”Basic single-value extractions evaluated as individual fields:
classes: - $schema: "https://json-schema.org/draft/2020-12/schema" $id: invoice x-aws-idp-document-type: invoice type: object properties: invoice_number: type: string description: The unique identifier for the invoice x-aws-idp-evaluation-method: EXACT # Use exact string matching amount_due: type: string description: The total amount to be paid x-aws-idp-evaluation-method: NUMERIC_EXACT # Use numeric comparison vendor_name: type: string description: Name of the vendor x-aws-idp-evaluation-method: FUZZY # Use fuzzy matching x-aws-idp-confidence-threshold: 0.8 # Minimum similarity thresholdGroup Attributes
Section titled “Group Attributes”Nested object structures where each sub-attribute is evaluated individually:
classes: - $schema: "https://json-schema.org/draft/2020-12/schema" $id: BankStatement x-aws-idp-document-type: "Bank Statement" type: object properties: Account Holder Address: type: object description: "Complete address information for the account holder" properties: Street Number: type: string description: "House or building number" x-aws-idp-evaluation-method: FUZZY x-aws-idp-confidence-threshold: 0.9 Street Name: type: string description: "Name of the street" x-aws-idp-evaluation-method: FUZZY x-aws-idp-confidence-threshold: 0.8 City: type: string description: "City name" x-aws-idp-evaluation-method: FUZZY x-aws-idp-confidence-threshold: 0.9 State: type: string description: "State abbreviation (e.g., CA, NY)" x-aws-idp-evaluation-method: EXACT ZIP Code: type: string description: "5 or 9 digit postal code" x-aws-idp-evaluation-method: EXACTList Attributes
Section titled “List Attributes”Arrays of items where each item’s attributes are evaluated individually across all list entries.
Important: Structured arrays (arrays of objects) use HUNGARIAN matching with a special match-threshold parameter that applies to the entire list, not individual fields.
classes: - $schema: "https://json-schema.org/draft/2020-12/schema" $id: BankStatement x-aws-idp-document-type: "Bank Statement" type: object properties: Transactions: type: array description: "List of all transactions in the statement period" x-aws-idp-list-item-description: "Individual transaction record" x-aws-idp-evaluation-method: HUNGARIAN # Required for List[Object] x-aws-idp-evaluation-match-threshold: 0.8 # Threshold for item matching items: type: object properties: Date: type: string description: "Transaction date (MM/DD/YYYY)" x-aws-idp-evaluation-method: FUZZY x-aws-idp-evaluation-threshold: 0.9 Description: type: string description: "Transaction description or merchant name" x-aws-idp-evaluation-method: SEMANTIC x-aws-idp-evaluation-threshold: 0.7 Amount: type: string description: "Transaction amount" x-aws-idp-evaluation-method: NUMERIC_EXACTUnderstanding Threshold vs Match-Threshold
Section titled “Understanding Threshold vs Match-Threshold”The evaluation framework uses two different threshold concepts depending on the field type:
For Regular Fields (String, Number, Object):
Section titled “For Regular Fields (String, Number, Object):”Use x-aws-idp-evaluation-threshold to control how closely values must match:
vendor_name: type: string x-aws-idp-evaluation-method: FUZZY x-aws-idp-evaluation-threshold: 0.85 # 85% similarity requiredFor Structured Arrays (List[Object] with HUNGARIAN):
Section titled “For Structured Arrays (List[Object] with HUNGARIAN):”Use x-aws-idp-evaluation-match-threshold to control item pairing:
CityTaxes: type: array x-aws-idp-evaluation-method: HUNGARIAN x-aws-idp-evaluation-match-threshold: 0.8 # 80% match for optimal pairing items: type: object properties: city: type: string tax_amount: type: numberWhy the distinction?
- Regular threshold: Applied to individual field comparisons
- Match threshold: Applied to Hungarian algorithm’s optimal item pairing logic
- Using the wrong threshold type will result in a validation error
UI Behavior:
- When you select an array field with object items in the configuration UI
- And choose HUNGARIAN method
- The UI automatically shows “Match Threshold” instead of “Evaluation Threshold”
- The system prevents you from using incompatible method-field combinations
Method Compatibility Rules
Section titled “Method Compatibility Rules”Different evaluation methods work with different field types:
| Method | Compatible Types | Requires |
|---|---|---|
| EXACT | String, Number, Integer, Boolean | evaluation-threshold (optional) |
| NUMERIC_EXACT | Number, Integer, String | evaluation-threshold (tolerance) |
| FUZZY | String | evaluation-threshold (required) |
| LEVENSHTEIN | String | evaluation-threshold (required) |
| SEMANTIC | String, Object | evaluation-threshold (required) |
| LLM | String, Object, Array | evaluation-threshold (required) |
| HUNGARIAN | Array[Object] ONLY | evaluation-match-threshold (required) |
Validation:
- Backend validates method-type compatibility and raises clear errors for mismatches
- UI filters available methods based on the selected field type
- Cannot select HUNGARIAN for non-array fields
- Cannot select other methods for structured arrays (must use HUNGARIAN)
Attribute Processing and Evaluation
Section titled “Attribute Processing and Evaluation”The evaluation framework automatically processes nested structures by flattening them into individual evaluable fields:
Group Attribute Processing
Section titled “Group Attribute Processing”Group attributes are flattened using dot notation:
Account Holder Address.Street Number(evaluated with FUZZY method)Account Holder Address.City(evaluated with FUZZY method)Account Holder Address.State(evaluated with EXACT method)
List Attribute Processing
Section titled “List Attribute Processing”List attributes are processed by creating individual evaluations for each array item:
Transactions[0].Date(evaluated with FUZZY method)Transactions[0].Amount(evaluated with NUMERIC_EXACT method)Transactions[1].Date(evaluated with FUZZY method)Transactions[1].Amount(evaluated with NUMERIC_EXACT method)- And so on for each transaction in the list…
Evaluation Reports for Nested Structures
Section titled “Evaluation Reports for Nested Structures”The evaluation reports provide detailed breakdowns for all nested attributes:
Group Attribute Results:
| Status | Attribute | Expected | Actual | Confidence | Score | Method | Reason || :----: | --------- | -------- | ------ | :--------: | ----- | ------ | ------ || ✅ | Account Holder Address.Street Number | 123 | 123 | 0.95 | 1.00 | FUZZY | Exact match || ✅ | Account Holder Address.City | Seattle | Seattle | 0.88 | 1.00 | FUZZY | Exact match || ❌ | Account Holder Address.State | WA | Washington | 0.82 | 0.00 | EXACT | Values do not match exactly |List Attribute Results:
| Status | Attribute | Expected | Actual | Confidence | Score | Method | Reason || :----: | --------- | -------- | ------ | :--------: | ----- | ------ | ------ || ✅ | Transactions[0].Date | 01/15/2024 | 01/15/2024 | 0.94 | 1.00 | FUZZY | Exact match || ✅ | Transactions[0].Amount | -25.00 | -25.00 | 0.92 | 1.00 | NUMERIC_EXACT | Exact numeric match || ✅ | Transactions[1].Description | Coffee Shop | Starbucks Coffee | 0.85 | 0.88 | SEMANTIC | Semantically similar |Evaluation Metrics for Complex Documents
Section titled “Evaluation Metrics for Complex Documents”For documents with nested structures, the evaluation framework provides comprehensive metrics at multiple levels:
- Overall Document Metrics: Aggregate accuracy across all attributes (simple, group, and list)
- Section-Level Metrics: Performance within each document section
- Attribute-Level Metrics: Individual performance for each flattened attribute
- Group-Level Insights: Summary statistics for related attributes within groups
- List-Level Analysis: Pattern analysis across list items (e.g., transaction accuracy trends)
This multi-level analysis helps identify specific areas for improvement, such as:
- Consistent issues with certain group attributes (e.g., address parsing)
- Performance degradation with larger transaction lists
- Specific list item attributes that frequently fail evaluation
Document Split Classification Metrics
Section titled “Document Split Classification Metrics”In addition to extraction accuracy evaluation, the framework now includes document split classification metrics to assess how accurately documents are classified and split into sections. This provides a comprehensive evaluation of both what was extracted and how documents were classified and organized.
Overview
Section titled “Overview”Document split classification metrics evaluate three key aspects:
- Page-Level Classification: Accuracy of classifying individual pages
- Document Split Grouping: Accuracy of grouping pages into sections
- Page Order Preservation: Accuracy of maintaining correct page order within sections
These metrics are calculated by comparing the document_class and split_document.page_indices fields in each section’s result JSON.
Three Types of Accuracy
Section titled “Three Types of Accuracy”1. Page Level Accuracy
Section titled “1. Page Level Accuracy”Purpose: Measures how accurately individual pages are classified, regardless of how they’re grouped into sections.
Calculation: For each page index across all sections, compare the expected document_class with the predicted document_class.
Use Case: Identify if pages are being assigned to the correct document types.
Example:
Expected: Page 0 → Invoice, Page 1 → Invoice, Page 2 → ReceiptPredicted: Page 0 → Invoice, Page 1 → Receipt, Page 2 → ReceiptResult: 2/3 pages correct = 66.7% accuracy2. Split Accuracy (Without Page Order)
Section titled “2. Split Accuracy (Without Page Order)”Purpose: Measures whether pages are correctly grouped into sections with the right classification, regardless of page order.
Calculation: For each expected section, check if any predicted section has:
- The same set of page indices (as a set, order doesn’t matter)
- The same
document_class
Both conditions must be met for a section to be marked as correct.
Use Case: Verify that pages belonging together are kept together, even if their order might vary.
Example:
Expected Section A: Class=Invoice, Pages={0, 1, 2}Predicted Section X: Class=Invoice, Pages={2, 0, 1} ✅ Match (same set)
Expected Section B: Class=Receipt, Pages={3, 4}Predicted Section Y: Class=Receipt, Pages={3, 4} ✅ Match
Expected Section C: Class=Payslip, Pages={5}Predicted Section Z: Class=Invoice, Pages={5} ❌ No match (wrong class)
Result: 2/3 sections correct = 66.7% accuracy3. Split Accuracy (With Page Order)
Section titled “3. Split Accuracy (With Page Order)”Purpose: Most strict evaluation - measures correct grouping with exact page order preservation.
Calculation: Same as “Without Order” but the page indices list must match exactly (same pages, same order).
Use Case: Verify that multi-page documents maintain correct page sequence.
Example:
Expected Section A: Class=Invoice, Pages=[0, 1, 2]Predicted Section X: Class=Invoice, Pages=[0, 1, 2] ✅ Match (exact order)
Expected Section B: Class=Receipt, Pages=[3, 4]Predicted Section Y: Class=Receipt, Pages=[4, 3] ❌ No match (wrong order)
Result: 1/2 sections correct = 50% accuracyReport Structure
Section titled “Report Structure”Document split metrics are integrated into the unified evaluation report:
# Evaluation Report
## Summary**Document Split Classification:**- Page Level Accuracy: 🟢 85/100 pages [████████████████░░░░] 85%- Split Accuracy (Without Order): 🟡 15/20 sections [███████████████░░░░░] 75%- Split Accuracy (With Order): 🟠 12/20 sections [████████████░░░░░░░░] 60%
**Document Extraction:**- Match Rate: 🟢 145/150 attributes matched [███████████████████░] 97%- Precision: 0.97 | Recall: 0.95 | F1 Score: 🟢 0.96
## Overall Metrics
### Document Split Classification Metrics| Metric | Value | Rating || page_level_accuracy | 0.8500 | 🟡 Good || split_accuracy_without_order | 0.7500 | 🟡 Good || split_accuracy_with_order | 0.6000 | 🟠 Fair |
### Document Extraction Metrics| Metric | Value | Rating || precision | 0.9700 | 🟢 Excellent || recall | 0.9500 | 🟢 Excellent || f1_score | 0.9600 | 🟢 Excellent |Data Structure Requirements
Section titled “Data Structure Requirements”For doc split metrics to be calculated, each section’s result JSON must include:
{ "document_class": { "type": "Invoice" }, "split_document": { "page_indices": [0, 1] }, "inference_result": { // Extracted attributes }}- Page indices are 0-based and may be non-sequential
- Missing or null fields are handled gracefully (treated as “Unknown” class or empty page list)
Setup and Usage
Section titled “Setup and Usage”Step 1: Creating Baseline Data
Section titled “Step 1: Creating Baseline Data”Creating accurate baseline data is the foundation of the evaluation framework. There are two main approaches:
Method 1: Use Existing Processing Results (Using Copy to Baseline Feature)
Section titled “Method 1: Use Existing Processing Results (Using Copy to Baseline Feature)”- Process documents through the GenAIIDP solution
- Review the output in the web UI
- Make any necessary corrections
- For documents with satisfactory results, click “Copy to Baseline”
- The system will asynchronously copy all processing results to the baseline bucket
- The document status will update to indicate baseline availability:
- BASELINE_COPYING: Copy operation in progress
- BASELINE_AVAILABLE: Document successfully copied to baseline
- BASELINE_ERROR: Error occurred during the copy operation
Method 2: Create Baseline Data Manually
Section titled “Method 2: Create Baseline Data Manually”Important: Baselines must follow the correct directory structure:
- Create a directory named after your document (e.g.,
invoice.pdf/) - Inside, create a
sections/1/subdirectory - Place your
result.jsonfile insections/1/ - Upload the entire directory structure to the baseline bucket
Best Practices for Creating Baseline Files:
The easiest way to create accurate baseline files is to start with processed results:
-
Option A: Use Processed Results as Template
- Process your document through the GenAIIDP solution first
- Download the results from the OutputBucket
- Locate the
sections/1/result.jsonfile in the output - Find the
inference_resultsection within that file - Use this as your baseline template, making any necessary corrections
- The
inference_resultcontains the extracted attributes in the correct format
-
Option B: Use the Solution UI
- Process your document through the GenAIIDP solution
- In the Web UI, navigate to the processed document
- Click “View / Edit data” to review the extracted results
- Correct any errors directly in the UI
- Export or copy the corrected data to create your baseline
- This ensures your baseline matches the exact structure expected by the evaluation framework
Manual Steps:
- Create the directory structure following the pattern:
<document-name>/sections/1/ - Create a
result.jsonfile with extracted attributes in the correct format (using one of the methods above) - Upload the complete directory structure to the baseline bucket
Example structure to upload:
invoice.pdf/└── sections/ └── 1/ └── result.jsonExample result.json content:
{ "inference_result": { "Invoice Number": "INV-2024-001", "Invoice Date": "2024-01-15", "Total Amount": "$1,250.00", "Vendor Name": "Acme Corp" }}Understanding the Baseline Structure
Section titled “Understanding the Baseline Structure”All baselines must follow this directory structure in your S3 baseline bucket:
baseline-bucket/├── document1.pdf/│ └── sections/│ └── 1/│ └── result.json # Baseline for document1.pdf├── document2.pdf/│ └── sections/│ └── 1/│ └── result.json # Baseline for document2.pdf└── subfolder/ └── document3.pdf/ └── sections/ └── 1/ └── result.json # Baseline for subfolder/document3.pdfKey Structure Rules:
- Directory name matches the document filename (e.g.,
invoice.pdf/) - Contains a
sections/1/subdirectory - The
result.jsonfile contains the inference results in this format:
{ "inference_result": { "Invoice Number": "INV-2024-001", "Invoice Date": "2024-01-15", "Total Amount": "$1,250.00", "Vendor Name": "Acme Corp" }}Step 2: Viewing Evaluation Reports
Section titled “Step 2: Viewing Evaluation Reports”Once documents are processed with baselines:
- In the Web UI, select a document from the Documents list
- Click “View Evaluation Report” button
- The report displays:
- Section classification accuracy
- Field-by-field comparison with visual indicators (✅/❌)
- Analysis of differences with detailed reasons
- Overall accuracy assessment with color-coded metrics (🟢 Excellent, 🟡 Good, 🟠 Fair, 🔴 Poor)
- Progress bar visualizations for match rates
- Comprehensive metrics and performance ratings
- Confidence scores (if assessment is enabled)
Best Practices
Section titled “Best Practices”Baseline Management
Section titled “Baseline Management”- Start with high-quality baselines: Use processed results from the UI and correct any errors before creating baselines
- Maintain baseline consistency: Ensure all baseline files follow the correct directory structure
- Version your baselines: Keep different baseline sets for different document versions or testing scenarios
Evaluation Strategy
Section titled “Evaluation Strategy”- Enable auto-evaluation during testing/tuning phases: Get immediate feedback on accuracy
- Disable auto-evaluation in production: Reduce costs by evaluating only when needed
- Use evaluation reports to:
- Compare different processing patterns
- Test effects of prompt changes
- Monitor accuracy over time
- Identify areas for improvement
Configuration Best Practices
Section titled “Configuration Best Practices”- Choose appropriate evaluation methods: Match methods to your data types (EXACT for IDs, FUZZY for names, SEMANTIC for descriptions)
- Set realistic thresholds: Start with strict thresholds (0.9+) and adjust based on your accuracy requirements
- Configure confidence tracking: Enable assessment to get confidence scores in evaluation reports
Automatic Field Discovery
Section titled “Automatic Field Discovery”The evaluation framework automatically discovers and evaluates fields that exist in the data but are not defined in the configuration:
- Detects fields present in actual results, expected results, or both
- Uses LLM evaluation method by default for discovered fields
- Clearly marks discovered fields in the report
- Handles cases where fields are missing from either actual or expected results
This capability is valuable when:
- The complete schema is not yet fully defined
- You’re handling variations in extraction outputs
- Identifying potential new fields to add to your configuration
- Ensuring comprehensive evaluation coverage
Semantic vs LLM Evaluation
Section titled “Semantic vs LLM Evaluation”The framework offers two approaches for semantic evaluation:
-
SEMANTIC Method: Uses embedding-based comparison with Bedrock Titan embeddings
- Faster and more cost-effective than LLM-based evaluation
- Provides similarity scores without explanations
- Great for high-volume comparisons where speed is important
- Configurable threshold for matching sensitivity
-
LLM Method: Uses Bedrock Claude or other LLM models
- Provides detailed reasoning for why values match or don’t match
- Better at handling implicit/explicit information differences
- More nuanced understanding of semantic equivalence
- Ideal for cases where understanding the rationale is important
- Used as the default method for attributes discovered in the data but not in the configuration
Metrics and Monitoring
Section titled “Metrics and Monitoring”The evaluation framework includes comprehensive monitoring through CloudWatch metrics:
- Evaluation Success/Failure Rates: Track evaluation completion and error rates
- Baseline Data Availability: Monitor percentage of documents with baseline data for comparison
- Report Generation Performance: Track time to generate evaluation reports
- Model Usage Metrics: Monitor token consumption and API calls for evaluation models
- Accuracy Trends: Historical tracking of processing accuracy over time
The framework calculates the following detailed metrics for each document and section:
Extraction Accuracy Metrics:
- Precision: Accuracy of positive predictions (TP / (TP + FP))
- Recall: Coverage of actual positive cases (TP / (TP + FN))
- F1 Score: Harmonic mean of precision and recall
- Accuracy: Overall correctness (TP + TN) / (TP + TN + FP + FN)
- False Alarm Rate: Rate of false positives among negatives (FP / (FP + TN))
- False Discovery Rate: Rate of false positives among positive predictions (FP / (FP + TP))
- Weighted Overall Score: Field-importance-weighted aggregate score
Document Split Classification Metrics:
- Page Level Accuracy: Classification accuracy for individual pages
- Split Accuracy (Without Order): Correct page grouping regardless of order
- Split Accuracy (With Order): Correct page grouping with exact order
- Total Pages: Total number of pages evaluated
- Total Splits: Total number of document sections/splits evaluated
- Correctly Classified Pages: Count of pages with correct classification
- Correctly Split Sections: Count of sections with correct page grouping
The evaluation also tracks different evaluation statuses:
- RUNNING: Evaluation is in progress
- COMPLETED: Evaluation finished successfully
- FAILED: Evaluation encountered errors
- NO_BASELINE: No baseline data available for comparison
- BASELINE_COPYING: Process of copying document to baseline is in progress
- BASELINE_AVAILABLE: Document is available in the baseline
- BASELINE_ERROR: Error occurred during the baseline copy operation
Aggregate Evaluation Analytics and Reporting
Section titled “Aggregate Evaluation Analytics and Reporting”The solution includes a comprehensive analytics system that stores evaluation metrics in a structured database for advanced reporting and trend analysis.
ReportingDatabase Overview
Section titled “ReportingDatabase Overview”The evaluation framework automatically saves detailed metrics to an AWS Glue database (available from CloudFormation stack outputs as ReportingDatabase) containing three main tables:
1. document_evaluations
Section titled “1. document_evaluations”Stores document-level metrics including:
Extraction Metrics:
- Document ID, input key, evaluation date
- Overall accuracy, precision, recall, F1 score
- False alarm rate, false discovery rate
- Weighted overall score
- Execution time performance metrics
Document Split Classification Metrics:
- Page level accuracy (double)
- Split accuracy without order (double)
- Split accuracy with order (double)
- Total pages (int)
- Total splits (int)
- Correctly classified pages (int)
- Correctly split without order (int)
- Correctly split with order (int)
2. section_evaluations
Section titled “2. section_evaluations”Stores section-level metrics including:
- Document ID, section ID, section type
- Section-specific accuracy, precision, recall, F1 score
- Section classification performance
- Evaluation timestamps
3. attribute_evaluations
Section titled “3. attribute_evaluations”Stores detailed attribute-level metrics including:
- Document ID, section context, attribute name
- Expected vs actual values, match results
- Individual attribute scores and evaluation methods
- Detailed reasoning for matches/mismatches
Querying Evaluation Results
Section titled “Querying Evaluation Results”You have two primary ways to analyze evaluation data:
Option 1: Agent Analytics (Recommended for Most Users)
Section titled “Option 1: Agent Analytics (Recommended for Most Users)”The Agent Analytics feature in the Web UI provides a natural language interface to query and analyze evaluation results without writing SQL:
- Natural Language Queries: Ask questions like “Show me documents with accuracy below 80%” or “What attributes have the lowest match rates?”
- Automatic SQL Generation: The AI agent automatically writes optimized Athena queries based on your questions
- Interactive Visualizations: Generate charts, graphs, and tables to visualize evaluation trends
- No SQL Knowledge Required: Ideal for business users and analysts
Access Agent Analytics through the Web UI’s “Document Analytics” section. For detailed guidance, see docs/agent-analysis.md.
Option 2: Direct Athena SQL Queries
Section titled “Option 2: Direct Athena SQL Queries”For advanced users and automated workflows, you can query the evaluation tables directly with Amazon Athena.
All evaluation data is partitioned by date and document for efficient querying:
-- Example: Find documents with low accuracy in the last 7 daysSELECT document_id, accuracy, evaluation_dateFROM "your-database-name".document_evaluationsWHERE evaluation_date >= current_date - interval '7' day AND accuracy < 0.8ORDER BY accuracy ASC;
-- Example: Analyze attribute-level performance trendsSELECT attribute_name, COUNT(*) as total_evaluations, AVG(CASE WHEN matched THEN 1.0 ELSE 0.0 END) as match_rate, AVG(score) as avg_scoreFROM "your-database-name".attribute_evaluationsWHERE evaluation_date >= current_date - interval '30' dayGROUP BY attribute_nameORDER BY match_rate ASC;
-- Example: Section type performance analysisSELECT section_type, COUNT(*) as total_sections, AVG(accuracy) as avg_accuracy, AVG(f1_score) as avg_f1_scoreFROM "your-database-name".section_evaluationsGROUP BY section_typeORDER BY avg_accuracy DESC;
-- Example: Query doc split classification performanceSELECT document_id, page_level_accuracy, split_accuracy_without_order, split_accuracy_with_order, total_pages, total_splits, evaluation_dateFROM "your-database-name".document_evaluationsWHERE page_level_accuracy < 0.9ORDER BY page_level_accuracy ASC;
-- Example: Compare doc split vs extraction accuracySELECT AVG(page_level_accuracy) as avg_page_classification_accuracy, AVG(split_accuracy_without_order) as avg_split_grouping_accuracy, AVG(precision) as avg_extraction_precision, AVG(recall) as avg_extraction_recall, AVG(f1_score) as avg_extraction_f1FROM "your-database-name".document_evaluationsWHERE evaluation_date >= current_date - interval '7' day;
-- Example: Identify documents with page classification issuesSELECT document_id, total_pages, correctly_classified_pages, page_level_accuracy, ROUND((total_pages - correctly_classified_pages), 0) as misclassified_pagesFROM "your-database-name".document_evaluationsWHERE page_level_accuracy < 1.0ORDER BY misclassified_pages DESC;
-- Example: Analyze split accuracy trends over timeSELECT DATE_TRUNC('day', evaluation_date) as eval_day, COUNT(*) as documents_evaluated, AVG(split_accuracy_without_order) as avg_split_accuracy_unordered, AVG(split_accuracy_with_order) as avg_split_accuracy_orderedFROM "your-database-name".document_evaluationsWHERE evaluation_date >= current_date - interval '30' dayGROUP BY DATE_TRUNC('day', evaluation_date)ORDER BY eval_day DESC;Analytics Notebook
Section titled “Analytics Notebook”The solution includes a comprehensive Jupyter notebook (notebooks/evaluation_reporting_analytics.ipynb) that provides:
- Automated Data Loading: Connects to Athena and automatically loads partitions for all evaluation tables
- Table Testing: Validates connectivity and shows content summaries for document, section, and attribute evaluation tables
- Multi-level Analysis: Document, section, and attribute-level performance insights with detailed breakdowns
- Visual Analytics: Rich charts and graphs showing accuracy trends, problem areas, and performance distributions
- Problem Identification: Automatically flags low-performing documents, sections, and attributes requiring attention
- Trend Analysis: Historical accuracy tracking showing improvement/regression patterns over time
- Configurable Filters: Dynamic filtering by date ranges, document name patterns, and accuracy thresholds
- Method Comparison: Analysis of different evaluation methods and their effectiveness
- Processing Time Analysis: Correlation between execution time and accuracy performance
Key Analytics Features:
Section titled “Key Analytics Features:”- Comprehensive Dashboard: Interactive summary report with health indicators and top issues
- Problem Detection Reports:
- Documents with lowest accuracy scores
- Section types with poor performance
- Attributes with low match rates and common failure reasons
- Accuracy Trend Analysis: Track same documents over time to identify improvement/regression patterns
- Processing Performance: Analyze correlation between processing time and accuracy
- Method Effectiveness: Compare different evaluation methods’ performance and coverage
- Export Capabilities: Save analysis results to CSV files for further analysis or reporting
Using the Analytics Notebook:
Section titled “Using the Analytics Notebook:”- Configuration: Set your ReportingDatabase name, AWS region, and S3 output location for Athena
- Filter Setup: Configure date range, document name filters, and accuracy thresholds
- Automated Analysis: Run partition loading, table testing, and comprehensive reporting
- Interactive Updates: Use
update_filters()function to dynamically change parameters and re-run analyses - Visual Insights: Review generated charts and visualizations for patterns and trends
- Export Results: Optional CSV export for stakeholder reporting and further analysis
Sample Analytics Use Cases:
Section titled “Sample Analytics Use Cases:”- Quality Monitoring: Weekly accuracy assessments across all document types
- Performance Tuning: Identify which attributes or sections need prompt improvements
- Trend Tracking: Monitor if recent changes improved or degraded accuracy
- Method Optimization: Compare evaluation methods to select the most effective approach
- Problem Prioritization: Focus improvement efforts on consistently problematic areas
Data Retention and Partitioning
Section titled “Data Retention and Partitioning”- Evaluation data is automatically partitioned by year/month/day/document for efficient querying
- Data retention follows the stack’s
DataRetentionInDaysparameter - Partitions are automatically loaded when using the analytics notebook
- Historical data enables long-term trend analysis and accuracy monitoring
Best Practices for Analytics
Section titled “Best Practices for Analytics”- Regular Monitoring: Use the analytics notebook weekly to identify accuracy trends
- Threshold Tuning: Adjust accuracy thresholds based on your use case requirements
- Pattern Recognition: Look for patterns in low-performing document types or sections
- Comparative Analysis: Compare performance across different prompt configurations
- Automated Alerts: Set up CloudWatch alarms based on accuracy metrics stored in the database
Migration from Legacy Evaluation
Section titled “Migration from Legacy Evaluation”The feature/stickler branch introduces a new Stickler-based evaluation service while preserving the legacy implementation for backward compatibility:
- New:
service.py(Stickler-based) - default for new deployments - Legacy:
service_legacy.py- preserved for existing workflows
All existing configurations are compatible with the Stickler service through the SticklerConfigMapper, which translates IDP evaluation extensions to Stickler format transparently.
What Changed
Section titled “What Changed”- Backend Engine: Stickler for structured comparison instead of custom comparators
- Field Weighting: New capability for business-critical fields (
x-aws-idp-evaluation-weight) - Dynamic Schema Generation: Auto-generate schemas from baseline data when configuration is missing
- Type Coercion: Automatic type conversion for baseline data compatibility
- Weighted Scores: New
weighted_overall_scoremetric in evaluation results - Enhanced Error Handling: Graceful degradation with zero-metric results for failed section evaluations
What Stayed the Same
Section titled “What Stayed the Same”- API: Same
evaluate_document()andevaluate_section()methods - Configuration Format: Same JSON Schema with evaluation extensions
- Report Format: Same Markdown and JSON outputs
- Evaluation Methods: Same EXACT, FUZZY, NUMERIC_EXACT, SEMANTIC, LLM, HUNGARIAN
- Assessment Integration: Same confidence score display in reports
- Analytics Database: Same table structure and querying capabilities
Migration Checklist
Section titled “Migration Checklist”✅ No Action Required - Existing configurations work with Stickler service automatically
✅ Optional Enhancements:
- Add
x-aws-idp-evaluation-weightto critical fields for business-aligned scoring - Review auto-generated schema warnings in logs and create explicit configs
- Test type coercion with your baseline data format
✅ Validation:
- Run evaluation on sample documents to verify results
- Review evaluation reports for accuracy and completeness
- Check CloudWatch logs for any warnings or errors
Stickler Version Information
Section titled “Stickler Version Information”The solution uses Stickler from GitHub:
- Repository: https://github.com/awslabs/stickler
- Branch:
sr/json_schema_construction(temporary) - Commit:
de7d0fda6d551088d9b43bea5adb39e58d04b314 - Migration Path: Will switch to main branch once PR #20 merges
For version details, see lib/idp_common_pkg/idp_common/evaluation/stickler_version.py
Troubleshooting Evaluation Issues
Section titled “Troubleshooting Evaluation Issues”Common issues and resolutions:
-
Missing Baseline Data
- Verify baseline files exist in the baseline bucket
- Check that baseline filenames match the input document keys
- Ensure baseline files are valid JSON
-
Evaluation Failures
- Check Lambda function logs for error details
- Verify that the evaluation model is available in your region
- Increase Lambda timeout if needed for complex documents
-
Low Accuracy Scores
- Review document quality and OCR results
- Examine prompt configurations for classification and extraction
- Check for processing errors in the workflow execution
-
Analytics Database Issues
- Ensure the ReportingDatabase is accessible from your AWS account
- Check that evaluation results are being written to the reporting bucket
- Verify Athena permissions for querying Glue tables
- Use “MSCK REPAIR TABLE” in Athena to refresh partitions if needed
-
Stickler Dependency Issues
- Ensure Stickler is installed:
pip install -e '.[evaluation]' - Verify installation:
python -c "from stickler import StructuredModel; print('OK')" - Check version: See
lib/idp_common_pkg/idp_common/evaluation/stickler_version.py - Review Stickler documentation: https://github.com/awslabs/stickler
- Ensure Stickler is installed:
-
Type Coercion Warnings
- Check Lambda logs for type coercion messages
- Verify baseline data types match schema expectations
- Update schema to match baseline format if needed
- Test with representative documents before production
-
Auto-Generated Schema Warnings
- Look for WARNING level logs about auto-generated schemas
- Create explicit configuration for production document classes
- Review generated schema structure in logs
- Test explicit config before disabling auto-generation