Skip to content

Reporting Database

Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0

The GenAI IDP Accelerator includes a comprehensive reporting database that captures detailed metrics about document processing. This database is implemented as AWS Glue tables over Amazon S3 data in Parquet format, making it queryable through Amazon Athena for analytics and reporting purposes.

The evaluation tables store metrics and results from comparing extracted document data against baseline (ground truth) data. These tables provide insights into the accuracy and performance of the document processing system.

The document_evaluations table contains document-level evaluation metrics:

ColumnTypeDescription
document_idstringUnique identifier for the document
input_keystringS3 key of the input document
evaluation_datetimestampWhen the evaluation was performed
accuracydoubleOverall accuracy score (0-1)
precisiondoublePrecision score (0-1)
recalldoubleRecall score (0-1)
f1_scoredoubleF1 score (0-1)
false_alarm_ratedoubleFalse alarm rate (0-1)
false_discovery_ratedoubleFalse discovery rate (0-1)
execution_timedoubleTime taken to evaluate (seconds)

This table is partitioned by date (YYYY-MM-DD format).

The section_evaluations table contains section-level evaluation metrics:

ColumnTypeDescription
document_idstringUnique identifier for the document
section_idstringIdentifier for the section
section_typestringType/class of the section
accuracydoubleSection accuracy score (0-1)
precisiondoubleSection precision score (0-1)
recalldoubleSection recall score (0-1)
f1_scoredoubleSection F1 score (0-1)
false_alarm_ratedoubleSection false alarm rate (0-1)
false_discovery_ratedoubleSection false discovery rate (0-1)
evaluation_datetimestampWhen the evaluation was performed

This table is partitioned by date (YYYY-MM-DD format).

The attribute_evaluations table contains attribute-level evaluation metrics:

ColumnTypeDescription
document_idstringUnique identifier for the document
section_idstringIdentifier for the section
section_typestringType/class of the section
attribute_namestringName of the attribute
expectedstringExpected (ground truth) value
actualstringActual extracted value
matchedbooleanWhether the values matched
scoredoubleMatch score (0-1)
reasonstringExplanation for the match result
evaluation_methodstringMethod used for comparison
confidencestringConfidence score from extraction
confidence_thresholdstringConfidence threshold used
evaluation_datetimestampWhen the evaluation was performed

This table is partitioned by date (YYYY-MM-DD format).

The rule validation tables store business rule validation results showing compliance and policy adherence for processed documents.

The rule_validation_summary table contains document-level rule validation results:

ColumnTypeDescription
document_idstringUnique identifier for the document
input_keystringS3 key of the input document
validation_datetimestampWhen the validation was performed
overall_statusstringOverall validation status (COMPLETE, FAILED, etc.)
total_rule_typesintNumber of rule types evaluated
total_rulesintTotal number of rules evaluated
pass_countintNumber of rules that passed
fail_countintNumber of rules that failed
information_not_found_countintNumber of rules where information was not found

This table is partitioned by date (YYYY-MM-DD format).

The rule_validation_details table contains individual rule validation results:

ColumnTypeDescription
document_idstringUnique identifier for the document
rule_typestringCategory/type of the rule
rulestringDescription of the specific rule being validated
recommendationstringValidation result (Pass, Fail, Information Not Found)
reasoningstringExplanation for the recommendation
supporting_pagesstringJSON array of page numbers supporting the validation
validation_datetimestampWhen the validation was performed

This table is partitioned by date (YYYY-MM-DD format).

The metering table captures detailed usage metrics and cost information for each document processing operation:

ColumnTypeDescription
document_idstringUnique identifier for the document
contextstringProcessing context (OCR, Classification, Extraction, etc.)
service_apistringSpecific API or model used (e.g., textract/analyze_document, bedrock/claude-3)
unitstringUnit of measurement (pages, inputTokens, outputTokens, etc.)
valuedoubleQuantity of the unit consumed
number_of_pagesintNumber of pages in the document
unit_costdoubleCost per unit in USD (e.g., cost per token, cost per page)
estimated_costdoubleCalculated total cost in USD (value × unit_cost)
timestamptimestampWhen the operation was performed

This table is partitioned by date (YYYY-MM-DD format).

The metering table now includes automated cost calculation capabilities:

  • unit_cost: Retrieved from pricing configuration for each service_api/unit combination
  • estimated_cost: Automatically calculated as value × unit_cost for each record
  • Dynamic Pricing: Costs are loaded from configuration and cached for performance
  • Fallback Handling: When pricing data is not available, unit_cost defaults to $0.0

Pricing data is loaded from the system configuration in the following format:

pricing:
- name: "bedrock/us.anthropic.claude-3-sonnet-20240229-v1:0"
units:
- name: "inputTokens"
price: "3.0e-6" # $0.000003 per input token
- name: "outputTokens"
price: "1.5e-5" # $0.000015 per output token
- name: "textract/analyze_document"
units:
- name: "pages"
price: "0.0015" # $0.0015 per page
  1. Service/Unit Matching: System attempts exact match for service_api/unit combination
  2. Partial Matching: If exact match fails, uses fuzzy matching for common patterns
  3. Cost Calculation: estimated_cost = value × unit_cost
  4. Caching: Pricing data is cached to avoid repeated configuration lookups

The metering table is particularly valuable for:

  • Cost analysis and allocation - Track spending by document type, service, or time period
  • Usage pattern identification - Analyze consumption patterns across different models
  • Resource optimization - Identify cost-effective processing approaches
  • Performance benchmarking - Compare cost efficiency across different document types and sizes
  • Budget monitoring - Track actual costs against budgets and forecasts

The document sections tables store the actual extracted data from document sections in a structured format suitable for analytics. These tables are automatically created when new section types are encountered during document processing, eliminating the need for manual table creation.

When a document is processed and a new section type (classification) is detected, the system automatically:

  1. Creates a new Glue table for that section type (e.g., document_sections_invoice, document_sections_receipt, document_sections_w2)
  2. Configures the table with appropriate schema based on the extracted data
  3. Sets up partition projection for efficient date-based queries
  4. Updates the table schema if new fields are detected in subsequent documents

Important: Section type names are normalized to lowercase for consistency with case-sensitive S3 paths. For example, a section classified as “W2” will create a table named document_sections_w2 with data stored in document_sections/w2/.

Document sections are stored in dynamically created tables based on the section classification. Each section type gets its own table with the following characteristics:

Common Metadata Columns:

ColumnTypeDescription
section_idstringUnique identifier for the section
document_idstringUnique identifier for the document
section_classificationstringType/class of the section
section_confidencedoubleConfidence score for the section classification
timestamptimestampWhen the document was processed

Dynamic Data Columns: The remaining columns are dynamically inferred from the JSON extraction results and vary by section type. Common patterns include:

  • Nested JSON objects are flattened using dot notation (e.g., customer.name, customer.address.street)
  • Arrays are converted to JSON strings
  • Primitive values (strings, numbers, booleans) are preserved as their native types

Partitioning: Each section type table is partitioned by date (YYYY-MM-DD format) for efficient querying.

File Organization:

document_sections/
├── invoice/
│ └── date=2024-01-15/
│ ├── doc-123_section_1.parquet
│ └── doc-456_section_3.parquet
├── receipt/
│ └── date=2024-01-15/
│ └── doc-789_section_2.parquet
└── bank_statement/
└── date=2024-01-15/
└── doc-abc_section_1.parquet

The AWS Glue Crawler automatically discovers new section types and creates corresponding tables. The crawler can be configured to run:

  • Manually (on-demand)
  • Every 15 minutes
  • Every hour
  • Daily (default)

This ensures that new section types are automatically available for querying without manual intervention.

Amazon Athena provides a serverless query service to analyze data directly in Amazon S3. The reporting database tables are automatically registered in the AWS Glue Data Catalog, making them immediately available for querying in Athena.

To use the reporting database with Athena:

  1. Open the Amazon Athena console
  2. Select the database named after your stack (e.g., idp_reporting)
  3. Start querying the tables using standard SQL

Here are some example queries to get you started:

Overall accuracy by document type:

SELECT
section_type,
AVG(accuracy) as avg_accuracy,
COUNT(*) as document_count
FROM
section_evaluations
GROUP BY
section_type
ORDER BY
avg_accuracy DESC;

Token usage by model:

SELECT
service_api,
SUM(CASE WHEN unit = 'inputTokens' THEN value ELSE 0 END) as total_input_tokens,
SUM(CASE WHEN unit = 'outputTokens' THEN value ELSE 0 END) as total_output_tokens,
SUM(CASE WHEN unit = 'totalTokens' THEN value ELSE 0 END) as total_tokens,
COUNT(DISTINCT document_id) as document_count
FROM
metering
WHERE
context = 'Extraction'
GROUP BY
service_api
ORDER BY
total_tokens DESC;

Extraction confidence vs. accuracy:

SELECT
CASE
WHEN CAST(confidence AS double) < 0.7 THEN 'Low (<0.7)'
WHEN CAST(confidence AS double) < 0.9 THEN 'Medium (0.7-0.9)'
ELSE 'High (>0.9)'
END as confidence_band,
AVG(CASE WHEN matched THEN 1.0 ELSE 0.0 END) as accuracy,
COUNT(*) as attribute_count
FROM
attribute_evaluations
WHERE
confidence IS NOT NULL
GROUP BY
CASE
WHEN CAST(confidence AS double) < 0.7 THEN 'Low (<0.7)'
WHEN CAST(confidence AS double) < 0.9 THEN 'Medium (0.7-0.9)'
ELSE 'High (>0.9)'
END
ORDER BY
confidence_band;

Token usage per page by document type:

SELECT
se.section_type,
AVG(m.value / m.number_of_pages) as avg_tokens_per_page
FROM
metering m
JOIN
section_evaluations se ON m.document_id = se.document_id
WHERE
m.unit = 'totalTokens'
AND m.number_of_pages > 0
GROUP BY
se.section_type
ORDER BY
avg_tokens_per_page DESC;

Document sections analysis by type:

-- Query invoice sections for customer analysis
SELECT
document_id,
section_id,
"customer.name" as customer_name,
"customer.address.city" as customer_city,
"total_amount" as invoice_total,
date
FROM
invoice
WHERE
date BETWEEN '2024-01-01' AND '2024-01-31'
ORDER BY
date DESC;

Section processing volume by date:

-- Count sections processed by type and date
SELECT
date,
section_classification,
COUNT(*) as section_count,
COUNT(DISTINCT document_id) as document_count
FROM (
SELECT date, section_classification, document_id FROM invoice
UNION ALL
SELECT date, section_classification, document_id FROM receipt
UNION ALL
SELECT date, section_classification, document_id FROM bank_statement
)
GROUP BY
date, section_classification
ORDER BY
date DESC, section_count DESC;

Date range queries with new partition structure:

-- Efficient date range query using single date partition
SELECT
COUNT(*) as total_documents,
AVG(accuracy) as avg_accuracy
FROM
document_evaluations
WHERE
date BETWEEN '2024-01-01' AND '2024-01-31';
-- Monthly aggregation
SELECT
SUBSTR(date, 1, 7) as month,
COUNT(*) as document_count,
AVG(accuracy) as avg_accuracy
FROM
document_evaluations
WHERE
date >= '2024-01-01'
GROUP BY
SUBSTR(date, 1, 7)
ORDER BY
month;

Cost analysis queries:

-- Total estimated costs by service API
SELECT
service_api,
SUM(estimated_cost) as total_cost,
AVG(estimated_cost) as avg_cost_per_operation,
COUNT(*) as operation_count,
COUNT(DISTINCT document_id) as document_count
FROM
metering
WHERE
date BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY
service_api
ORDER BY
total_cost DESC;
-- Cost per page analysis by document type
SELECT
se.section_type,
SUM(m.estimated_cost) / SUM(m.number_of_pages) as cost_per_page,
SUM(m.estimated_cost) as total_cost,
SUM(m.number_of_pages) as total_pages,
COUNT(DISTINCT m.document_id) as document_count
FROM
metering m
JOIN
section_evaluations se ON m.document_id = se.document_id
WHERE
m.number_of_pages > 0
AND m.date BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY
se.section_type
ORDER BY
cost_per_page DESC;
-- Daily cost trends
SELECT
date,
SUM(estimated_cost) as daily_cost,
COUNT(DISTINCT document_id) as documents_processed,
SUM(estimated_cost) / COUNT(DISTINCT document_id) as avg_cost_per_document
FROM
metering
WHERE
date BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY
date
ORDER BY
date;
-- Most expensive documents
SELECT
document_id,
SUM(estimated_cost) as total_document_cost,
SUM(value) as total_units_consumed,
COUNT(*) as operations_count,
MAX(number_of_pages) as page_count
FROM
metering
WHERE
date BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY
document_id
ORDER BY
total_document_cost DESC
LIMIT 10;
-- Cost efficiency by model (cost per token)
SELECT
service_api,
SUM(estimated_cost) / SUM(value) as cost_per_token,
SUM(estimated_cost) as total_cost,
SUM(value) as total_tokens,
COUNT(DISTINCT document_id) as document_count
FROM
metering
WHERE
unit IN ('inputTokens', 'outputTokens', 'totalTokens')
AND date BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY
service_api
ORDER BY
cost_per_token ASC;
-- Cost breakdown by processing context
SELECT
context,
SUM(estimated_cost) as total_cost,
COUNT(DISTINCT document_id) as document_count,
SUM(estimated_cost) / COUNT(DISTINCT document_id) as avg_cost_per_document
FROM
metering
WHERE
date BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY
context
ORDER BY
total_cost DESC;

For more advanced visualization and dashboarding:

  1. Use Amazon QuickSight to connect to your Athena queries
  2. Create interactive dashboards to monitor:
    • Extraction accuracy over time
    • Cost trends by document type
    • Performance metrics by model
    • Resource utilization patterns

You can also export query results to CSV or other formats for use with external business intelligence tools.