Error Analyzer (Troubleshooting Tool) - PREVIEW
Error Analyzer (Troubleshooting Tool) - PREVIEW
Section titled “Error Analyzer (Troubleshooting Tool) - PREVIEW”Overview
Section titled “Overview”The Error Analyzer is an intelligent AI-powered troubleshooting tool that helps diagnose and resolve document processing failures in the GenAI IDP Accelerator. It uses Amazon Bedrock’s Claude Sonnet 4 model with the Strands agent framework to automatically analyze CloudWatch logs, DynamoDB tracking data, and Step Functions execution history to identify root causes and provide actionable recommendations.
This tool is not yet mature - we expect to refine and improve the capabilities in successive releases. Try it, and give us feedback via GitHub Issues.
Key Capabilities
Section titled “Key Capabilities”- Automatic Failure Diagnosis: AI agent automatically investigates document processing failures
- Intelligent Query Routing: Distinguishes between document-specific and system-wide analysis
- Multi-Source Analysis: Correlates data from CloudWatch Logs, DynamoDB, and Step Functions
- Contextual Recommendations: Provides specific guidance for configuration, operational, or code issues
- Real-Time Updates: Live job status with progress tracking and resumption capability
- Evidence-Based Analysis: Shows detailed log evidence supporting diagnostic conclusions
When to Use the Error Analyzer
Section titled “When to Use the Error Analyzer”- Document Processing Failures: Investigate why a specific document failed to process
- Recurring Error Patterns: Identify systemic issues affecting multiple documents
- Performance Investigation: Analyze timeout errors and processing bottlenecks
- System Health Checks: Review recent errors across the entire system
- Troubleshooting Support: Generate detailed error reports for support escalation
Architecture
Section titled “Architecture”System Design
Section titled “System Design”flowchart TD UI[Web UI - TroubleshootModal] -->|GraphQL Mutation| Submit[submitAgentQuery] Submit --> Agent[Error Analyzer Agent] Agent -->|Route Query| Router{analyze_errors Tool}
Router -->|Document-Specific| DocAnalysis[analyze_document_failure] Router -->|System-Wide| SysAnalysis[analyze_recent_system_errors]
DocAnalysis --> GetContext[get_document_context] DocAnalysis --> SearchDocLogs[search_document_logs]
SysAnalysis --> FindTable[find_tracking_table] SysAnalysis --> ScanDB[scan_dynamodb_table] SysAnalysis --> SearchStackLogs[search_stack_logs]
GetContext --> DDB[(DynamoDB<br/>TrackingTable)] SearchDocLogs --> CW[(CloudWatch Logs)] SearchStackLogs --> CW ScanDB --> DDB
DocAnalysis --> Result[Analysis Result] SysAnalysis --> Result Result -->|GraphQL Subscription| UITool Ecosystem
Section titled “Tool Ecosystem”The Error Analyzer uses 8 specialized tools organized in a modular architecture:
flowchart LR subgraph "Main Router" Router[analyze_errors] end
subgraph "Document Analysis" DocFail[analyze_document_failure] GetCtx[get_document_context] SearchDoc[search_document_logs] end
subgraph "System Analysis" SysErr[analyze_recent_system_errors] FindTbl[find_tracking_table] ScanTbl[scan_dynamodb_table] SearchStk[search_stack_logs] end
Router --> DocFail Router --> SysErr DocFail --> GetCtx DocFail --> SearchDoc SysErr --> FindTbl SysErr --> ScanTbl SysErr --> SearchStkTool Descriptions:
-
analyze_errors (Main Router)
- Classifies query intent (document-specific vs system-wide)
- Routes to appropriate analysis tool
- Manages time range parsing
-
analyze_document_failure (Document-Specific)
- Investigates individual document failures
- Retrieves execution context and Lambda request IDs
- Searches document-specific logs
-
analyze_recent_system_errors (System-Wide)
- Analyzes error patterns across the system
- Categorizes errors by type
- Provides statistical summaries
-
get_document_context (Lambda Integration)
- Retrieves document tracking data
- Extracts Step Functions execution ARN
- Provides Lambda request IDs for tracing
-
search_document_logs (CloudWatch)
- Filters logs by document ObjectKey
- Searches across multiple log groups
- Returns events with timestamps and context
-
search_stack_logs (CloudWatch)
- System-wide log pattern matching
- Multi-pattern prioritized search
- Adaptive sampling for context management
-
find_tracking_table (DynamoDB Discovery)
- Locates TrackingTable by stack name
- Validates table existence
-
scan_dynamodb_table (DynamoDB Query)
- Scans for failed documents
- Filters by status and time range
- Returns document metadata
Query Classification Logic
Section titled “Query Classification Logic”flowchart TD Query[User Query] --> Check{Document-Specific<br/>Pattern?}
Check -->|Match| DocPattern["Patterns:<br/>• document: filename.pdf<br/>• file: report.docx<br/>• ObjectKey: path/file"] Check -->|No Match| GeneralPattern["General Queries:<br/>• Recent errors<br/>• System failures<br/>• Processing issues"]
DocPattern --> DocAnalysis[Document-Specific<br/>Analysis] GeneralPattern --> SysAnalysis[System-Wide<br/>Analysis]
DocAnalysis --> DocResult["Results:<br/>• Execution context<br/>• Document-specific logs<br/>• Lambda request IDs"] SysAnalysis --> SysResult["Results:<br/>• Error categories<br/>• Failed documents<br/>• Pattern statistics"]Using the Error Analyzer
Section titled “Using the Error Analyzer”Via Web UI
Section titled “Via Web UI”Accessing the Troubleshoot Modal
Section titled “Accessing the Troubleshoot Modal”- Navigate to Dashboard: Open the GenAI IDP Web UI
- Find Failed Document: Locate a document with
FAILEDstatus - Click Troubleshoot Button: Opens the TroubleshootModal
- Automatic Analysis: Agent immediately begins analyzing the failure
Understanding the Interface
Section titled “Understanding the Interface”The TroubleshootModal displays:
- Document Information: Shows the ObjectKey being analyzed
- Status Indicator:
PENDING: Job submitted, waiting to startPROCESSING: Agent actively analyzing with real-time messagesCOMPLETED: Analysis finished, results availableFAILED: Analysis encountered an error
- Agent Messages: Live progress updates during processing
- Results Display: Formatted analysis with collapsible sections
- Job Resumption: If you close and reopen the modal, the existing job resumes
Reading Analysis Results
Section titled “Reading Analysis Results”Results are structured in three sections:
1. Root Cause
The underlying technical reason for the failure. Focuses on the primarycause rather than symptoms.
Example: "Bedrock throttling exception due to exceeding token rate limitsfor the configured model."2. Recommendations
Specific, actionable steps to resolve the issue. Limited to top threerecommendations with clear guidance.
Example:• Increase provisioned throughput for the Bedrock model• Adjust retry configuration in classification settings• Consider using batch processing to reduce concurrent requests3. Evidence (Collapsible)
<details><summary><strong>Evidence</strong></summary>
**Log Group:**/aws-stack-name/lambda/ClassificationFunction
**Log Stream:**2025/01/03/[$LATEST]abc123def456
[ERROR] 2025-01-03T14:23:45.123Z ThrottlingException: Rate exceeded</details>Query Patterns
Section titled “Query Patterns”Document-Specific Queries
Section titled “Document-Specific Queries”Use these patterns to analyze a specific document:
document: lending_package.pdffile: bank_statement.docxObjectKey: uploads/2024/contract.pdfThe query must include the keyword (document:, file:, or ObjectKey:) followed immediately by a colon and the filename.
System-Wide Queries
Section titled “System-Wide Queries”Use natural language for general analysis:
Find recent processing errorsWhat errors occurred in the last week?Show me system failuresSummarize recent problemsTime Range Specifications
Section titled “Time Range Specifications”The agent interprets time ranges intelligently:
| Query Phrase | Time Range |
|---|---|
| ”recent” or “recently” | 1 hour |
| ”last hour” | 1 hour |
| ”last day” or “yesterday” | 24 hours |
| ”last week” | 168 hours (7 days) |
| No time specified | 24 hours (default) |
Configuration
Section titled “Configuration”Agent Configuration in template.yaml
Section titled “Agent Configuration in template.yaml”The Error Analyzer is configured in the CloudFormation template under the agents section of the configuration schema:
agents: error_analyzer: type: object sectionLabel: Error Analysis Agent properties: model_id: type: string enum: [ "anthropic.claude-3-sonnet-20240229-v1:0", "us.anthropic.claude-3-5-sonnet-20241022-v2:0", "us.anthropic.claude-3-7-sonnet-20250219-v1:0", "us.anthropic.claude-sonnet-4-20250514-v1:0" ] default: "us.anthropic.claude-sonnet-4-20250514-v1:0" system_prompt: type: string format: textarea parameters: type: object properties: max_log_events: type: integer default: 5 time_range_hours_default: type: integer default: 24Configuration Parameters
Section titled “Configuration Parameters”model_id
Section titled “model_id”Purpose: Selects the Bedrock model for error analysis
Recommended: us.anthropic.claude-sonnet-4-20250514-v1:0
- Superior reasoning for complex error diagnosis
- Better structured output formatting
- More accurate root cause identification
Alternative Options:
us.anthropic.claude-3-7-sonnet-20250219-v1:0: Good balance of cost and capabilityus.anthropic.claude-3-5-sonnet-20241022-v2:0: Cost-effective for simple errorsanthropic.claude-3-sonnet-20240229-v1:0: Legacy option
system_prompt
Section titled “system_prompt”Purpose: Defines agent behavior and response formatting
Key Requirements:
- Enforce three-section structure (Root Cause, Recommendations, Evidence)
- Specify evidence formatting with collapsible HTML details
- Define recommendation guidelines for different issue types
- Set time range parsing rules
Default Prompt Highlights:
You are an intelligent error analysis agent for the GenAI IDP system.
ALWAYS format your response with exactly these three sections:## Root Cause## Recommendations<details><summary><strong>Evidence</strong></summary>...</details>
RECOMMENDATION GUIDELINES:For code/system bugs: Do not suggest code modificationsFor configuration issues: Direct users to UI configuration panelFor operational issues: Provide immediate troubleshooting stepsparameters.max_log_events
Section titled “parameters.max_log_events”Purpose: Limits log events returned to manage context window
Default: 5 events
Range: 1-100
Considerations:
- Higher values provide more context but consume more tokens
- For system-wide analysis, uses adaptive sampling across patterns
- Individual log messages are truncated if exceeding 200 characters
Tuning Guidance:
- Simple errors: 3-5 events sufficient
- Complex investigations: 10-20 events
- Pattern analysis: 20-50 events
parameters.time_range_hours_default
Section titled “parameters.time_range_hours_default”Purpose: Default lookback period when not specified in query
Default: 24 hours (1 day)
Range: 1-168 hours (1 week)
Considerations:
- Longer ranges increase CloudWatch Logs query time
- Wider time windows may return less relevant results
- Balance between coverage and performance
Tuning Guidance:
- Active development: 1-6 hours
- Production monitoring: 24 hours
- Post-mortem analysis: 72-168 hours
Configuration Example (config.yaml)
Section titled “Configuration Example (config.yaml)”agents: error_analyzer: model_id: us.anthropic.claude-sonnet-4-20250514-v1:0 system_prompt: | You are an intelligent error analysis agent for the GenAI IDP system.
Use the analyze_errors tool to investigate issues. ALWAYS format your response with exactly these three sections in this order:
## Root Cause Identify the specific underlying technical reason why the error occurred.
## Recommendations Provide specific, actionable steps to resolve the issue. Limit to top three recommendations only.
<details> <summary><strong>Evidence</strong></summary> Format log entries with their source information... </details> parameters: max_log_events: 5 time_range_hours_default: 24Understanding Results
Section titled “Understanding Results”Root Cause Analysis
Section titled “Root Cause Analysis”The Root Cause section identifies the underlying technical reason for the failure, not just symptoms.
Good Root Cause Examples:
✓ "Bedrock model returned ValidationException due to malformed JSON in the extraction prompt, caused by unescaped special characters in attribute descriptions"
✓ "Lambda function timeout after 900 seconds during assessment processing, triggered by processing a document with 247 pages exceeding memory limits"
✓ "Access denied error when reading OCR results from S3, caused by missing kms:Decrypt permission on the customer-managed encryption key"Poor Root Cause Examples (too vague):
✗ "The document failed to process"✗ "There was an error in the system"✗ "Lambda function had a problem"Recommendations
Section titled “Recommendations”Recommendations are specific, actionable steps tailored to the issue type.
Configuration-Related Recommendations
Section titled “Configuration-Related Recommendations”For configuration issues, the agent directs users to the UI:
Recommendations:• Navigate to Configuration panel in the Web UI• Update 'extraction.model' to use a higher-capacity model• Adjust 'assessment.granular.max_workers' from 4 to 2 to reduce memory pressureOperational Recommendations
Section titled “Operational Recommendations”For operational issues, provides immediate troubleshooting:
Recommendations:• Retry the failed document - error appears transient• Check AWS Service Health Dashboard for Bedrock service issues• Monitor CloudWatch metrics for throttling patterns in next 30 minutesCode/System Bug Recommendations
Section titled “Code/System Bug Recommendations”For code issues, focuses on reporting not fixing:
Recommendations:• Report to development team with error details and timestamp• Include Lambda request ID: abc-123-def-456 for debugging• Avoid reprocessing this document type until patch is deployedEvidence Section
Section titled “Evidence Section”The Evidence section provides verifiable log data supporting the analysis.
Structure:
<details><summary><strong>Evidence</strong></summary>
**Log Group:**/aws-stack-name/lambda/ExtractionFunction
**Log Stream:**2025/01/03/[$LATEST]abc123def456
**Events:**[ERROR] 2025-01-03T15:42:13.456Z RequestId: xyz-789 ValidationException: JSON parsing error at line 42
</details>Reading Evidence:
- Log Group: Identifies which Lambda function encountered the error
- Log Stream: Provides exact execution instance for deep-dive investigation
- Events: Shows actual error messages with timestamps
- Truncation: Long messages truncated to ”… [truncated]” for readability
Advanced Features
Section titled “Advanced Features”Intelligent Query Classification
Section titled “Intelligent Query Classification”The agent uses regex pattern matching to determine analysis type:
# Document-specific patterns (require colon immediately after keyword)document: filename.pdf # Matchesfile: report.docx # MatchesObjectKey: path/file # Matches
# General analysis (no specific pattern)recent errors # System-widefind failures # System-widewhat happened # System-widePattern Detection Logic:
If query matches "document:\s*([^\s]+)" → Document-Specific AnalysisIf query matches "file:\s*([^\s]+)" → Document-Specific AnalysisIf query matches "ObjectKey:\s*([^\s]+)" → Document-Specific AnalysisOtherwise → System-Wide AnalysisMulti-Pattern Error Detection
Section titled “Multi-Pattern Error Detection”System-wide analysis uses prioritized pattern matching:
flowchart TD Start[System-Wide Query] --> P1[Pattern 1: ERROR<br/>Priority: High<br/>Max Events: 5] P1 --> P2[Pattern 2: Exception<br/>Priority: Medium<br/>Max Events: 3] P2 --> P3[Pattern 3: ValidationException<br/>Priority: Medium<br/>Max Events: 2] P3 --> P4[Pattern 4: Failed<br/>Priority: Low<br/>Max Events: 2] P4 --> P5[Pattern 5: Timeout<br/>Priority: Low<br/>Max Events: 1] P5 --> Dedupe[Deduplication &<br/>Filtering] Dedupe --> Result[Final Event Set]Prioritization Strategy:
- ERROR: Highest priority, captures 5 events
- Exception: Important errors, captures 3 events
- ValidationException: Specific validation issues, 2 events
- Failed: General failures, 2 events
- Timeout: Performance issues, 1 event
Context Management:
- Respects
max_log_eventsparameter as total limit - Uses adaptive sampling across patterns
- Deduplicates similar error messages
- Truncates long messages at 200 characters
Error Categorization
Section titled “Error Categorization”System-wide analysis categorizes errors for pattern identification:
Category Definitions:
-
validation_errors
- Keywords: “validation”, “invalid”
- Indicates data quality or format issues
- Often fixable through configuration
-
processing_errors
- Keywords: “exception”, “error”
- Core processing failures
- May require code fixes
-
timeout_errors
- Keywords: “timeout”
- Performance/resource issues
- Adjustable through memory/timeout settings
-
access_errors
- Keywords: “access”, “denied”
- Permission problems
- Requires IAM policy updates
-
system_errors
- Catch-all for other errors
- Infrastructure or service issues
Category Summary Example:
{ "error_categories": { "validation_errors": { "count": 12, "sample": "ValidationException: Invalid attribute schema..." }, "timeout_errors": { "count": 5, "sample": "Task timed out after 900.00 seconds" } }}Job Resumption
Section titled “Job Resumption”The Web UI maintains job state for seamless resumption:
stateDiagram-v2 [*] --> Creating: Open Modal Creating --> Pending: Job Created Pending --> Processing: Agent Starts Processing --> Completed: Success Processing --> Failed: Error
Processing --> Stored: User Closes Modal Stored --> Processing: User Reopens Modal
Completed --> [*]: User Closes Failed --> [*]: User ClosesHow It Works:
- Job Creation: Modal creates job with unique
jobId - Parent Tracking: Component stores job state in parent
- Modal Close: Job continues running in background
- Modal Reopen: Automatically resumes displaying existing job
- Status Updates: Real-time updates via GraphQL subscription
User Experience:
- Users can close modal without losing analysis
- Reopening shows current progress or final results
- No need to re-submit for in-progress or completed jobs
- New jobs only created when previous job is COMPLETED/FAILED
Best Practices
Section titled “Best Practices”When to Use the Error Analyzer
Section titled “When to Use the Error Analyzer”✓ Ideal Use Cases
Section titled “✓ Ideal Use Cases”-
Document Processing Failures
Scenario: Specific document failed with FAILED statusQuery: "document: customer_form_2024.pdf"Benefit: Pinpoints exact Lambda function and error cause -
Recurring Error Patterns
Scenario: Multiple documents failing with similar errorsQuery: "What errors occurred in the last 6 hours?"Benefit: Identifies systemic issues affecting multiple documents -
Performance Investigation
Scenario: Documents timing out during processingQuery: "Show me timeout errors in the last day"Benefit: Reveals resource constraints and bottlenecks -
Post-Deployment Validation
Scenario: New configuration deployed, checking for issuesQuery: "Recent processing errors"Benefit: Quick health check after changes -
Support Ticket Creation
Scenario: Need detailed error report for escalationQuery: "document: problem_file.pdf"Benefit: Generates formatted report with evidence
✗ Not Suitable For
Section titled “✗ Not Suitable For”- Pre-deployment Testing: Use evaluation tools and test sets instead
- Performance Optimization: Use CloudWatch metrics and dashboards
- Capacity Planning: Use monitoring and reporting features
- Cost Analysis: Use the cost calculator and pricing reports
Query Formulation Best Practices
Section titled “Query Formulation Best Practices”Be Specific with Document IDs
Section titled “Be Specific with Document IDs”Good ✓document: lending_package_2024_Q1.pdffile: bank_statement_january.docxObjectKey: uploads/healthcare/prior_auth_12345.pdf
Poor ✗document lending package # Missing colonfind document # Too vaguecheck that failed file # No specific IDUse Appropriate Time Ranges
Section titled “Use Appropriate Time Ranges”Good ✓Show errors in the last hour # Recent issuesWhat happened yesterday? # Specific timeframeRecent system failures # Uses default 24h
Poor ✗Show all errors ever # Too broad, slow queryFind problems # No time contextCheck everything # Vague and expensiveLet the Agent Classify Intent
Section titled “Let the Agent Classify Intent”Good ✓document: contract.pdf # Clear document-specificRecent validation errors # Clear system-wideWhat went wrong today? # Natural language OK
Poor ✗Analyze document contract.pdf # Ambiguous formatFind errors for file: x.pdf and system # Mixed intentsInterpreting Results Effectively
Section titled “Interpreting Results Effectively”1. Focus on Root Cause, Not Symptoms
Section titled “1. Focus on Root Cause, Not Symptoms”Example Analysis:
Root Cause: Lambda function exhausted 4096 MB memory limit while processinga 150-page document with high-resolution images during OCR conversion
Symptoms (don't focus on these):- Lambda timeout after 15 minutes- No results written to S3- Document stuck in PROCESSING statusAction: Address the root cause (memory limit) rather than symptoms (timeout).
2. Prioritize Top Recommendations
Section titled “2. Prioritize Top Recommendations”The agent limits recommendations to top three most impactful actions:
Recommendation Priority:1. Immediate Fix: Increase OCR Lambda memory to 8192 MB2. Short-term: Implement image preprocessing to reduce resolution3. Long-term: Add document size validation before processingDon’t try to implement all suggestions at once - start with #1.
3. Use Evidence for Verification
Section titled “3. Use Evidence for Verification”Cross-reference recommendations with evidence:
Recommendation: "Increase Lambda memory allocation"
Evidence Validation:✓ Log shows: "@maxMemoryUsed: 4089 MB" (near 4096 MB limit)✓ Event type: "Task timed out after 900.00 seconds"✓ Pattern: Occurs on documents with >100 pagesIf evidence doesn’t support recommendation, request clarification.
4. Consider Error Categories
Section titled “4. Consider Error Categories”System-wide analysis categorizes errors:
Categories Found:- validation_errors: 15 (most common)- timeout_errors: 3- access_errors: 1
Action: Focus on validation_errors first as they affect most documentsConfiguration Best Practices
Section titled “Configuration Best Practices”Model Selection
Section titled “Model Selection”Choose model based on error complexity:
# Simple validation errors, frequent analysismodel_id: us.anthropic.claude-3-5-sonnet-20241022-v2:0
# Complex multi-component failures, critical analysismodel_id: us.anthropic.claude-sonnet-4-20250514-v1:0 # Recommended
# Legacy support onlymodel_id: anthropic.claude-3-sonnet-20240229-v1:0Adjusting max_log_events
Section titled “Adjusting max_log_events”Tune based on analysis type:
parameters: # Development/testing - need detailed context max_log_events: 20
# Production monitoring - balance detail and cost max_log_events: 5 # Default
# Quick health checks - minimize cost max_log_events: 3Time Range Optimization
Section titled “Time Range Optimization”Set default based on deployment frequency:
parameters: # Frequent deployments (multiple per day) time_range_hours_default: 6
# Daily deployments time_range_hours_default: 24 # Default
# Weekly deployments time_range_hours_default: 72Troubleshooting Common Issues
Section titled “Troubleshooting Common Issues”Agent Not Available
Section titled “Agent Not Available”Symptom: Error message “Error-Analyzer-Agent-v1 agent is not available”
Causes:
- Agent configuration not deployed
- Stack configuration outdated
- Agent ID mismatch
Resolution:
1. Check template.yaml includes agents.error_analyzer section2. Verify configuration deployed: aws cloudformation describe-stacks --stack-name <stack-name>3. Check available agents in Web UI Configuration panel4. Redeploy stack if configuration missingJob Timeout or Failure
Section titled “Job Timeout or Failure”Symptom: Job status shows FAILED or times out
Causes:
- Lambda function timeout (15 min limit)
- Insufficient memory
- Invalid permissions
- Bedrock throttling
Resolution:
1. Check CloudWatch Logs for agent Lambda function: /aws/<stack-name>/lambda/AgentFunction
2. Look for specific error messages: - "Task timed out" → Increase memory or reduce query scope - "AccessDeniedException" → Check IAM permissions - "ThrottlingException" → Wait and retry
3. For document-specific queries, ensure document exists: - Verify ObjectKey is correct - Check document in DynamoDB TrackingTableIncomplete Analysis Results
Section titled “Incomplete Analysis Results”Symptom: Analysis missing Root Cause or Recommendations sections
Causes:
- Model output formatting issue
- System prompt not enforced
- Token limit exceeded
Resolution:
1. Verify system_prompt includes formatting requirements: "ALWAYS format your response with exactly these three sections"
2. Check model_id is using recommended Claude Sonnet 4: us.anthropic.claude-sonnet-4-20250514-v1:0
3. If token limit reached, reduce max_log_events or time rangePermission-Related Issues
Section titled “Permission-Related Issues”Symptom: “Access denied” or “Permission denied” errors
Causes:
- Missing CloudWatch Logs permissions
- DynamoDB access denied
- KMS key permissions
Resolution:
IAM Permissions Required:- CloudWatch Logs: * logs:FilterLogEvents * logs:DescribeLogGroups * logs:DescribeLogStreams
- DynamoDB: * dynamodb:GetItem * dynamodb:Query * dynamodb:Scan
- KMS (if using customer-managed keys): * kms:Decrypt * kms:DescribeKey
Check Lambda execution role has these permissions.Evidence Section Not Showing
Section titled “Evidence Section Not Showing”Symptom: Evidence section is empty or missing
Causes:
- No matching log events in time range
- CloudWatch log retention expired
- Incorrect log group names
Resolution:
1. Increase time range: "Show errors in the last week"2. Check log retention in CloudWatch console3. Verify log group naming convention: /{StackName}/lambda/{FunctionName}4. Use system-wide query to check if any logs availableTechnical Details
Section titled “Technical Details”Integration Points
Section titled “Integration Points”AppSync GraphQL API
Section titled “AppSync GraphQL API”Mutations:
mutation SubmitAgentQuery { submitAgentQuery( query: "document: lending_package.pdf" agentIds: ["Error-Analyzer-Agent-v1"] ) { jobId status }}Queries:
query GetAgentJobStatus($jobId: ID!) { getAgentJobStatus(jobId: $jobId) { jobId status result agent_messages error }}Subscriptions:
subscription OnAgentJobComplete($jobId: ID!) { onAgentJobComplete(jobId: $jobId) { jobId status }}CloudWatch Logs Integration
Section titled “CloudWatch Logs Integration”Log Group Discovery:
# Pattern for stack log groupslog_group_pattern = f"/{stack_name}/lambda/"
# Searches across:- /stack-name/lambda/OCRFunction- /stack-name/lambda/ClassificationFunction- /stack-name/lambda/ExtractionFunction- /stack-name/lambda/AssessmentFunction- /stack-name/lambda/SummarizationFunctionLog Filtering:
# Document-specific filterfilter_pattern = f'"ObjectKey" = "{document_id}" "ERROR"'
# System-wide patterns (prioritized)patterns = ["ERROR", "Exception", "ValidationException", "Failed", "Timeout"]DynamoDB Tracking Integration
Section titled “DynamoDB Tracking Integration”Table Schema (relevant fields):
{ "ObjectKey": "uploads/document.pdf", # Partition key "ObjectStatus": "FAILED", # Document status "ExecutionArn": "arn:aws:states:...", # Step Functions ARN "CompletionTime": "2025-01-03T15:30:00Z", "ErrorMessage": "Processing failed...", # Optional error "LastModified": "2025-01-03T15:30:00Z"}Query Patterns:
# Find document by ObjectKeyresponse = table.get_item(Key={"ObjectKey": document_id})
# Scan for recent failuresresponse = table.scan( FilterExpression="ObjectStatus = :status AND CompletionTime > :time", ExpressionAttributeValues={ ":status": "FAILED", ":time": threshold_timestamp })Step Functions Integration
Section titled “Step Functions Integration”Execution Context:
# Extract execution ID from DynamoDBexecution_arn = "arn:aws:states:us-east-1:123456789012:execution:StateMachine:abc-123"execution_id = execution_arn.split(":")[-1] # "abc-123"
# Used for log filteringfilter_pattern = f'"execution_id" = "{execution_id}"'Tool Implementation Reference
Section titled “Tool Implementation Reference”analyze_errors (Main Router)
Section titled “analyze_errors (Main Router)”Location: lib/idp_common_pkg/idp_common/agents/error_analyzer/tools/error_analysis_tool.py
Function Signature:
@tooldef analyze_errors(query: str, time_range_hours: int = 1) -> Dict[str, Any]: """ Intelligent error analysis with precise query classification.
Args: query: User's error analysis query time_range_hours: Hours to look back (default: 1, uses config default)
Returns: Dict containing analysis results or error information """Classification Logic:
def _classify_query_intent(query: str) -> Tuple[str, str]: """ Classify query as document-specific vs general system analysis.
Returns: Tuple of (intent_type, document_id) - intent_type: "document_specific" or "general_analysis" - document_id: Extracted document ID or empty string """ specific_doc_patterns = [ r"document:\s*([^\s]+)", r"file:\s*([^\s]+)", r"ObjectKey:\s*([^\s]+)", ]
for pattern in specific_doc_patterns: match = re.search(pattern, query, re.IGNORECASE) if match: return ("document_specific", match.group(1).strip())
return ("general_analysis", "")analyze_document_failure
Section titled “analyze_document_failure”Location: lib/idp_common_pkg/idp_common/agents/error_analyzer/tools/document_analysis_tool.py
Purpose: Document-specific failure analysis
Key Operations:
- Retrieves document context from DynamoDB
- Searches CloudWatch logs filtered by ObjectKey
- Extracts Lambda request IDs for tracing
- Correlates execution context with errors
analyze_recent_system_errors
Section titled “analyze_recent_system_errors”Location: lib/idp_common_pkg/idp_common/agents/error_analyzer/tools/general_analysis_tool.py
Purpose: System-wide error pattern analysis
Key Operations:
- Scans DynamoDB for recent failures
- Multi-pattern CloudWatch log search
- Error categorization and statistics
- Adaptive sampling for context management
Performance Considerations
Section titled “Performance Considerations”CloudWatch Logs Queries:
- Each query scans specified time range across log groups
- Longer time ranges increase query latency
- Max 10,000 events per FilterLogEvents call
Cost Optimization:
# Efficient queriesmax_log_events = 5 # Minimal context window usagetime_range_hours = 1 # Recent errors only
# Expensive queries (use sparingly)max_log_events = 50 # Large context windowtime_range_hours = 168 # Full week scanToken Usage:
- System prompt: ~800 tokens
- Log events: ~100-200 tokens each
- Analysis response: ~500-1000 tokens
- Total per query: ~2000-4000 tokens average
Related Documentation
Section titled “Related Documentation”- Troubleshooting Guide: General troubleshooting for common issues (manual steps)
- Use the Error Analyzer for automated diagnosis
- Refer to Troubleshooting Guide for manual resolution steps, performance tuning, and infrastructure issues
- Monitoring: CloudWatch dashboards and metrics
- Web UI: User interface features and navigation
- Architecture: Overall system architecture
- Configuration: Configuration management
Error Analyzer vs Manual Troubleshooting
Section titled “Error Analyzer vs Manual Troubleshooting”Use Error Analyzer for:
- Document processing failures (root cause analysis)
- Recent error patterns across the system
- Automated log correlation and diagnosis
- Quick troubleshooting with AI-powered recommendations
Use Manual Troubleshooting Guide for:
- Infrastructure and deployment issues
- Performance optimization and tuning
- Security and authentication problems
- Build and configuration management
- DLQ processing and queue management
How does the Error Analyzer differ from CloudWatch Insights?
Section titled “How does the Error Analyzer differ from CloudWatch Insights?”Error Analyzer:
- AI-powered root cause identification
- Automated correlation across services
- Natural language query interface
- Actionable recommendations
- Integrated with IDP workflow
CloudWatch Insights:
- Manual query writing required
- Single log group analysis
- Technical query language
- Raw log data output
- Generic AWS service
Can I customize the system prompt?
Section titled “Can I customize the system prompt?”Yes, the system prompt is fully customizable in the configuration:
- Navigate to Configuration panel in Web UI
- Expand “Agent Configuration” section
- Edit “Error Analysis Agent” → “system_prompt”
- Save configuration
Caution: Modifying the system prompt may affect output formatting and quality.
How many concurrent analysis jobs can run?
Section titled “How many concurrent analysis jobs can run?”The Error Analyzer supports:
- Multiple users: Each can have active jobs
- Job per document: One active job per user per document
- System-wide queries: Unlimited concurrent queries
- Resource limits: Subject to Lambda concurrency and Bedrock quotas
What happens if analysis times out?
Section titled “What happens if analysis times out?”Timeout Handling:
- Lambda has 15-minute timeout
- Job status set to FAILED
- Partial results (if any) are saved
- User can retry with narrower scope:
- Reduce
time_range_hours - Reduce
max_log_events - Use document-specific query
- Reduce
Can I export analysis results?
Section titled “Can I export analysis results?”Export Options:
- Copy from UI: Select and copy formatted text
- API Access: Use
getAgentJobStatusquery - CloudWatch Logs: Agent logs contain full results
- Future Enhancement: Export to PDF/JSON (roadmap)
How long are analysis results retained?
Section titled “How long are analysis results retained?”Retention Policy:
- In-memory: Active jobs only
- DynamoDB: Not persisted (stateless)
- CloudWatch Logs: Per log group retention (default: 7-90 days)
- Recommendation: Screenshot or copy important analyses
Does the analyzer work with custom Lambda functions?
Section titled “Does the analyzer work with custom Lambda functions?”Yes, if custom Lambda functions:
- Write to CloudWatch Logs with stack-based log group names
- Include ObjectKey in log messages
- Follow standard error logging patterns
The analyzer will automatically discover and search these logs.
Limitations
Section titled “Limitations”Current Limitations
Section titled “Current Limitations”- Single Agent: Only Error-Analyzer-Agent-v1 supported
- English Only: Optimized for English log messages
- AWS Services: CloudWatch and DynamoDB only (no external logs)
- Pattern Matching: Regex-based classification may miss edge cases
- Context Window: Limited by Bedrock model token limits
Known Issues
Section titled “Known Issues”- Long Document IDs: ObjectKeys >200 characters may be truncated
- Special Characters: Some Unicode in logs may cause parsing issues
- High Volume: Systems with >1000 errors/hour may hit throttling
- Multi-Region: Analyzer only searches current region
Future Enhancements
Section titled “Future Enhancements”- Multi-language Support: Non-English log analysis
- Custom Patterns: User-defined error patterns
- Trend Analysis: Historical error pattern tracking
- Predictive Alerts: Proactive failure prediction
- Export Features: PDF/JSON report generation
- Integration: Slack/Teams notifications
Summary
Section titled “Summary”The Error Analyzer is a powerful AI-driven troubleshooting tool that:
✓ Automates failure diagnosis with AI-powered analysis
✓ Accelerates root cause identification from hours to minutes
✓ Correlates data across CloudWatch, DynamoDB, and Step Functions
✓ Provides actionable, context-specific recommendations
✓ Integrates seamlessly with the Web UI workflow
✓ Supports both document-specific and system-wide analysis
For optimal results:
- Use Claude Sonnet 4 model for complex errors
- Be specific with document IDs in queries
- Focus on root causes, not symptoms
- Verify recommendations with evidence
- Adjust configuration based on deployment patterns
The Error Analyzer significantly reduces troubleshooting time and improves operational efficiency for GenAI IDP deployments.