Reporting Database
Reporting Database
Section titled “Reporting Database”Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0
The GenAI IDP Accelerator includes a comprehensive reporting database that captures detailed metrics about document processing. This database is implemented as AWS Glue tables over Amazon S3 data in Parquet format, making it queryable through Amazon Athena for analytics and reporting purposes.
Table of Contents
Section titled “Table of Contents”- Evaluation Tables
- Rule Validation Tables
- Metering Table
- Document Sections Tables
- Using the Reporting Database with Athena
Evaluation Tables
Section titled “Evaluation Tables”The evaluation tables store metrics and results from comparing extracted document data against baseline (ground truth) data. These tables provide insights into the accuracy and performance of the document processing system.
Document Evaluations
Section titled “Document Evaluations”The document_evaluations table contains document-level evaluation metrics:
| Column | Type | Description |
|---|---|---|
| document_id | string | Unique identifier for the document |
| input_key | string | S3 key of the input document |
| evaluation_date | timestamp | When the evaluation was performed |
| accuracy | double | Overall accuracy score (0-1) |
| precision | double | Precision score (0-1) |
| recall | double | Recall score (0-1) |
| f1_score | double | F1 score (0-1) |
| false_alarm_rate | double | False alarm rate (0-1) |
| false_discovery_rate | double | False discovery rate (0-1) |
| execution_time | double | Time taken to evaluate (seconds) |
This table is partitioned by date (YYYY-MM-DD format).
Section Evaluations
Section titled “Section Evaluations”The section_evaluations table contains section-level evaluation metrics:
| Column | Type | Description |
|---|---|---|
| document_id | string | Unique identifier for the document |
| section_id | string | Identifier for the section |
| section_type | string | Type/class of the section |
| accuracy | double | Section accuracy score (0-1) |
| precision | double | Section precision score (0-1) |
| recall | double | Section recall score (0-1) |
| f1_score | double | Section F1 score (0-1) |
| false_alarm_rate | double | Section false alarm rate (0-1) |
| false_discovery_rate | double | Section false discovery rate (0-1) |
| evaluation_date | timestamp | When the evaluation was performed |
This table is partitioned by date (YYYY-MM-DD format).
Attribute Evaluations
Section titled “Attribute Evaluations”The attribute_evaluations table contains attribute-level evaluation metrics:
| Column | Type | Description |
|---|---|---|
| document_id | string | Unique identifier for the document |
| section_id | string | Identifier for the section |
| section_type | string | Type/class of the section |
| attribute_name | string | Name of the attribute |
| expected | string | Expected (ground truth) value |
| actual | string | Actual extracted value |
| matched | boolean | Whether the values matched |
| score | double | Match score (0-1) |
| reason | string | Explanation for the match result |
| evaluation_method | string | Method used for comparison |
| confidence | string | Confidence score from extraction |
| confidence_threshold | string | Confidence threshold used |
| evaluation_date | timestamp | When the evaluation was performed |
This table is partitioned by date (YYYY-MM-DD format).
Rule Validation Tables
Section titled “Rule Validation Tables”The rule validation tables store business rule validation results showing compliance and policy adherence for processed documents.
Rule Validation Summary
Section titled “Rule Validation Summary”The rule_validation_summary table contains document-level rule validation results:
| Column | Type | Description |
|---|---|---|
| document_id | string | Unique identifier for the document |
| input_key | string | S3 key of the input document |
| validation_date | timestamp | When the validation was performed |
| overall_status | string | Overall validation status (COMPLETE, FAILED, etc.) |
| total_rule_types | int | Number of rule types evaluated |
| total_rules | int | Total number of rules evaluated |
| pass_count | int | Number of rules that passed |
| fail_count | int | Number of rules that failed |
| information_not_found_count | int | Number of rules where information was not found |
This table is partitioned by date (YYYY-MM-DD format).
Rule Validation Details
Section titled “Rule Validation Details”The rule_validation_details table contains individual rule validation results:
| Column | Type | Description |
|---|---|---|
| document_id | string | Unique identifier for the document |
| rule_type | string | Category/type of the rule |
| rule | string | Description of the specific rule being validated |
| recommendation | string | Validation result (Pass, Fail, Information Not Found) |
| reasoning | string | Explanation for the recommendation |
| supporting_pages | string | JSON array of page numbers supporting the validation |
| validation_date | timestamp | When the validation was performed |
This table is partitioned by date (YYYY-MM-DD format).
Metering Table
Section titled “Metering Table”The metering table captures detailed usage metrics and cost information for each document processing operation:
| Column | Type | Description |
|---|---|---|
| document_id | string | Unique identifier for the document |
| context | string | Processing context (OCR, Classification, Extraction, etc.) |
| service_api | string | Specific API or model used (e.g., textract/analyze_document, bedrock/claude-3) |
| unit | string | Unit of measurement (pages, inputTokens, outputTokens, etc.) |
| value | double | Quantity of the unit consumed |
| number_of_pages | int | Number of pages in the document |
| unit_cost | double | Cost per unit in USD (e.g., cost per token, cost per page) |
| estimated_cost | double | Calculated total cost in USD (value × unit_cost) |
| timestamp | timestamp | When the operation was performed |
This table is partitioned by date (YYYY-MM-DD format).
Cost Calculation and Pricing
Section titled “Cost Calculation and Pricing”The metering table now includes automated cost calculation capabilities:
- unit_cost: Retrieved from pricing configuration for each service_api/unit combination
- estimated_cost: Automatically calculated as value × unit_cost for each record
- Dynamic Pricing: Costs are loaded from configuration and cached for performance
- Fallback Handling: When pricing data is not available, unit_cost defaults to $0.0
Pricing Configuration Format
Section titled “Pricing Configuration Format”Pricing data is loaded from the system configuration in the following format:
pricing: - name: "bedrock/us.anthropic.claude-3-sonnet-20240229-v1:0" units: - name: "inputTokens" price: "3.0e-6" # $0.000003 per input token - name: "outputTokens" price: "1.5e-5" # $0.000015 per output token - name: "textract/analyze_document" units: - name: "pages" price: "0.0015" # $0.0015 per pageCost Calculation Process
Section titled “Cost Calculation Process”- Service/Unit Matching: System attempts exact match for service_api/unit combination
- Partial Matching: If exact match fails, uses fuzzy matching for common patterns
- Cost Calculation: estimated_cost = value × unit_cost
- Caching: Pricing data is cached to avoid repeated configuration lookups
The metering table is particularly valuable for:
- Cost analysis and allocation - Track spending by document type, service, or time period
- Usage pattern identification - Analyze consumption patterns across different models
- Resource optimization - Identify cost-effective processing approaches
- Performance benchmarking - Compare cost efficiency across different document types and sizes
- Budget monitoring - Track actual costs against budgets and forecasts
Document Sections Tables
Section titled “Document Sections Tables”The document sections tables store the actual extracted data from document sections in a structured format suitable for analytics. These tables are automatically created when new section types are encountered during document processing, eliminating the need for manual table creation.
Automatic Table Creation
Section titled “Automatic Table Creation”When a document is processed and a new section type (classification) is detected, the system automatically:
- Creates a new Glue table for that section type (e.g.,
document_sections_invoice,document_sections_receipt,document_sections_w2) - Configures the table with appropriate schema based on the extracted data
- Sets up partition projection for efficient date-based queries
- Updates the table schema if new fields are detected in subsequent documents
Important: Section type names are normalized to lowercase for consistency with case-sensitive S3 paths. For example, a section classified as “W2” will create a table named document_sections_w2 with data stored in document_sections/w2/.
Dynamic Section Tables
Section titled “Dynamic Section Tables”Document sections are stored in dynamically created tables based on the section classification. Each section type gets its own table with the following characteristics:
Common Metadata Columns:
| Column | Type | Description |
|---|---|---|
| section_id | string | Unique identifier for the section |
| document_id | string | Unique identifier for the document |
| section_classification | string | Type/class of the section |
| section_confidence | double | Confidence score for the section classification |
| timestamp | timestamp | When the document was processed |
Dynamic Data Columns: The remaining columns are dynamically inferred from the JSON extraction results and vary by section type. Common patterns include:
- Nested JSON objects are flattened using dot notation (e.g.,
customer.name,customer.address.street) - Arrays are converted to JSON strings
- Primitive values (strings, numbers, booleans) are preserved as their native types
Partitioning: Each section type table is partitioned by date (YYYY-MM-DD format) for efficient querying.
File Organization:
document_sections/├── invoice/│ └── date=2024-01-15/│ ├── doc-123_section_1.parquet│ └── doc-456_section_3.parquet├── receipt/│ └── date=2024-01-15/│ └── doc-789_section_2.parquet└── bank_statement/ └── date=2024-01-15/ └── doc-abc_section_1.parquetCrawler Configuration
Section titled “Crawler Configuration”The AWS Glue Crawler automatically discovers new section types and creates corresponding tables. The crawler can be configured to run:
- Manually (on-demand)
- Every 15 minutes
- Every hour
- Daily (default)
This ensures that new section types are automatically available for querying without manual intervention.
Using the Reporting Database with Athena
Section titled “Using the Reporting Database with Athena”Amazon Athena provides a serverless query service to analyze data directly in Amazon S3. The reporting database tables are automatically registered in the AWS Glue Data Catalog, making them immediately available for querying in Athena.
To use the reporting database with Athena:
- Open the Amazon Athena console
- Select the database named after your stack (e.g.,
idp_reporting) - Start querying the tables using standard SQL
Sample Queries
Section titled “Sample Queries”Here are some example queries to get you started:
Overall accuracy by document type:
SELECT section_type, AVG(accuracy) as avg_accuracy, COUNT(*) as document_countFROM section_evaluationsGROUP BY section_typeORDER BY avg_accuracy DESC;Token usage by model:
SELECT service_api, SUM(CASE WHEN unit = 'inputTokens' THEN value ELSE 0 END) as total_input_tokens, SUM(CASE WHEN unit = 'outputTokens' THEN value ELSE 0 END) as total_output_tokens, SUM(CASE WHEN unit = 'totalTokens' THEN value ELSE 0 END) as total_tokens, COUNT(DISTINCT document_id) as document_countFROM meteringWHERE context = 'Extraction'GROUP BY service_apiORDER BY total_tokens DESC;Extraction confidence vs. accuracy:
SELECT CASE WHEN CAST(confidence AS double) < 0.7 THEN 'Low (<0.7)' WHEN CAST(confidence AS double) < 0.9 THEN 'Medium (0.7-0.9)' ELSE 'High (>0.9)' END as confidence_band, AVG(CASE WHEN matched THEN 1.0 ELSE 0.0 END) as accuracy, COUNT(*) as attribute_countFROM attribute_evaluationsWHERE confidence IS NOT NULLGROUP BY CASE WHEN CAST(confidence AS double) < 0.7 THEN 'Low (<0.7)' WHEN CAST(confidence AS double) < 0.9 THEN 'Medium (0.7-0.9)' ELSE 'High (>0.9)' ENDORDER BY confidence_band;Token usage per page by document type:
SELECT se.section_type, AVG(m.value / m.number_of_pages) as avg_tokens_per_pageFROM metering mJOIN section_evaluations se ON m.document_id = se.document_idWHERE m.unit = 'totalTokens' AND m.number_of_pages > 0GROUP BY se.section_typeORDER BY avg_tokens_per_page DESC;Document sections analysis by type:
-- Query invoice sections for customer analysisSELECT document_id, section_id, "customer.name" as customer_name, "customer.address.city" as customer_city, "total_amount" as invoice_total, dateFROM invoiceWHERE date BETWEEN '2024-01-01' AND '2024-01-31'ORDER BY date DESC;Section processing volume by date:
-- Count sections processed by type and dateSELECT date, section_classification, COUNT(*) as section_count, COUNT(DISTINCT document_id) as document_countFROM ( SELECT date, section_classification, document_id FROM invoice UNION ALL SELECT date, section_classification, document_id FROM receipt UNION ALL SELECT date, section_classification, document_id FROM bank_statement)GROUP BY date, section_classificationORDER BY date DESC, section_count DESC;Date range queries with new partition structure:
-- Efficient date range query using single date partitionSELECT COUNT(*) as total_documents, AVG(accuracy) as avg_accuracyFROM document_evaluationsWHERE date BETWEEN '2024-01-01' AND '2024-01-31';
-- Monthly aggregationSELECT SUBSTR(date, 1, 7) as month, COUNT(*) as document_count, AVG(accuracy) as avg_accuracyFROM document_evaluationsWHERE date >= '2024-01-01'GROUP BY SUBSTR(date, 1, 7)ORDER BY month;Cost analysis queries:
-- Total estimated costs by service APISELECT service_api, SUM(estimated_cost) as total_cost, AVG(estimated_cost) as avg_cost_per_operation, COUNT(*) as operation_count, COUNT(DISTINCT document_id) as document_countFROM meteringWHERE date BETWEEN '2024-01-01' AND '2024-01-31'GROUP BY service_apiORDER BY total_cost DESC;
-- Cost per page analysis by document typeSELECT se.section_type, SUM(m.estimated_cost) / SUM(m.number_of_pages) as cost_per_page, SUM(m.estimated_cost) as total_cost, SUM(m.number_of_pages) as total_pages, COUNT(DISTINCT m.document_id) as document_countFROM metering mJOIN section_evaluations se ON m.document_id = se.document_idWHERE m.number_of_pages > 0 AND m.date BETWEEN '2024-01-01' AND '2024-01-31'GROUP BY se.section_typeORDER BY cost_per_page DESC;
-- Daily cost trendsSELECT date, SUM(estimated_cost) as daily_cost, COUNT(DISTINCT document_id) as documents_processed, SUM(estimated_cost) / COUNT(DISTINCT document_id) as avg_cost_per_documentFROM meteringWHERE date BETWEEN '2024-01-01' AND '2024-01-31'GROUP BY dateORDER BY date;
-- Most expensive documentsSELECT document_id, SUM(estimated_cost) as total_document_cost, SUM(value) as total_units_consumed, COUNT(*) as operations_count, MAX(number_of_pages) as page_countFROM meteringWHERE date BETWEEN '2024-01-01' AND '2024-01-31'GROUP BY document_idORDER BY total_document_cost DESCLIMIT 10;
-- Cost efficiency by model (cost per token)SELECT service_api, SUM(estimated_cost) / SUM(value) as cost_per_token, SUM(estimated_cost) as total_cost, SUM(value) as total_tokens, COUNT(DISTINCT document_id) as document_countFROM meteringWHERE unit IN ('inputTokens', 'outputTokens', 'totalTokens') AND date BETWEEN '2024-01-01' AND '2024-01-31'GROUP BY service_apiORDER BY cost_per_token ASC;
-- Cost breakdown by processing contextSELECT context, SUM(estimated_cost) as total_cost, COUNT(DISTINCT document_id) as document_count, SUM(estimated_cost) / COUNT(DISTINCT document_id) as avg_cost_per_documentFROM meteringWHERE date BETWEEN '2024-01-01' AND '2024-01-31'GROUP BY contextORDER BY total_cost DESC;Creating Dashboards
Section titled “Creating Dashboards”For more advanced visualization and dashboarding:
- Use Amazon QuickSight to connect to your Athena queries
- Create interactive dashboards to monitor:
- Extraction accuracy over time
- Cost trends by document type
- Performance metrics by model
- Resource utilization patterns
You can also export query results to CSV or other formats for use with external business intelligence tools.