Troubleshooting Guide
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0
Troubleshooting Guide
Section titled “Troubleshooting Guide”This guide provides solutions for common issues and optimization techniques for the GenAIIDP solution.
AI-Powered Error Analysis
Section titled “AI-Powered Error Analysis”For automated troubleshooting, use the Error Analyzer tool:
- What it is: AI-powered agent that automatically diagnoses document processing failures
- When to use: Document-specific failures, system-wide error patterns, performance issues
- How to access: Web UI → Failed document → Troubleshoot button
- Documentation: See Error Analyzer for complete guide
Quick Start:
# Document-specific analysisQuery: "document: filename.pdf"
# System-wide analysisQuery: "Show recent processing errors"The Error Analyzer automatically:
- Searches CloudWatch Logs across all Lambda functions
- Correlates errors with DynamoDB tracking data
- Identifies root causes with AI reasoning
- Provides actionable recommendations
For issues not covered by the Error Analyzer, use the manual troubleshooting steps below.
Common Issues and Resolutions
Section titled “Common Issues and Resolutions”Document Processing Failures
Section titled “Document Processing Failures”| Issue | Resolution |
|---|---|
| Workflow execution fails | Check CloudWatch logs for specific error messages. Look in the Step Functions execution history to identify which step failed. |
| PDF document not processing | Verify the PDF is not password protected or encrypted. Ensure it’s not corrupted by opening it in another application. |
| OCR fails on document | Check if the document is scanned at sufficient quality. Verify the document doesn’t exceed size limits (typically 5MB for Textract). |
| Classification returns “other” | Review document class definitions. Consider adding more detailed class descriptions or adding few-shot examples. |
| Extraction missing fields | Review attribute descriptions and prompt engineering. Check if fields are present but in an unusual format or location. |
Web UI Access Issues
Section titled “Web UI Access Issues”| Issue | Resolution |
|---|---|
| Cannot login to Web UI | Verify Cognito user status and permissions in AWS Console. Check email for temporary credentials if first-time login. |
| Web UI loads but shows errors | Check browser console for specific error messages. Verify API endpoints are accessible. |
| Cannot see document history | Verify AWS AppSync API permissions. Check CloudWatch Logs for API errors. |
| Configuration changes not saving | Check browser console for validation errors. Verify that the configuration Lambda function has correct permissions. |
Model and Service Issues
Section titled “Model and Service Issues”| Issue | Resolution |
|---|---|
| Bedrock model throttling | Check CloudWatch metrics for throttling events. Consider increasing MaxConcurrentWorkflows parameter or requesting service quota increases. |
| SageMaker endpoint errors | Verify endpoint status in SageMaker console. Check endpoint logs for specific error messages. |
| Slow document processing | Monitor CloudWatch metrics to identify bottlenecks. Consider optimizing model selection or increasing concurrency limits. |
Infrastructure Issues
Section titled “Infrastructure Issues”| Issue | Resolution |
|---|---|
| Lambda function timeouts | Increase function timeout or memory allocation. Consider breaking processing into smaller chunks. |
| DynamoDB capacity exceeded | Check CloudWatch metrics for throttling. Consider increasing provisioned capacity or switching to on-demand capacity. |
| DynamoDB config upload fails: “Item size has exceeded the maximum allowed size” | This error occurred in versions prior to the compression fix when configurations had ~45+ document classes, exceeding DynamoDB’s 400KB item limit. Solution: Upgrade to the latest version, which gzip-compresses configuration data (supporting 3,000+ classes). Existing configs auto-migrate on next write. See GitHub Issue #200. |
| S3 permission errors | Verify bucket policies and IAM role permissions. Check for cross-account access issues. |
Agent Processing Issues
Section titled “Agent Processing Issues”| Issue | Resolution |
|---|---|
| Agent query shows “processing failed” | Check CloudWatch logs for the Agent Processing Lambda function ({StackName}-AgentProcessorFunction-*). Look for specific error messages, timeout issues, or permission errors. |
| External MCP agent not appearing | Verify the External MCP Agents secret is properly configured with valid JSON array format. Check CloudWatch logs for agent registration errors. |
| Agent responses are incomplete | Check CloudWatch logs for token limits, model throttling, or timeout issues in the Agent Processing function. |
Performance Considerations
Section titled “Performance Considerations”Resource Sizing
Section titled “Resource Sizing”Optimize performance through proper resource sizing:
-
Lambda Memory: Scale based on document complexity
- OCR Function: 1024-2048 MB recommended
- Classification/Extraction: 512-1024 MB for text-only, 1024-2048 MB for image-based processing
-
Timeouts: Configure appropriate timeouts
- Step Functions: 5-15 minutes for standard documents
- Lambda functions: 1-3 minutes for individual processing steps
- SQS visibility timeout: 5-6x Lambda function timeout
-
Concurrency Settings
- Set
MaxConcurrentWorkflowsparameter based on expected volume - Consider Lambda reserved concurrency for critical functions
- Monitor and adjust based on actual usage patterns
- Set
Performance Optimization Tips
Section titled “Performance Optimization Tips”-
Document Size and Quality
- Optimize input document size (600-1200 DPI recommended for scans)
- Reduce file size when possible without losing quality
- Consider preprocessing large documents to split them
-
Model Selection
- Balance accuracy vs. speed based on use case requirements
- Test different models with representative documents
- Consider smaller models for simple documents, larger models for complex extraction
-
Batch Processing
- For high volumes, stagger document uploads
- Use the load simulation scripts to test capacity
- Monitor queue depth and processing latency
Queue Management
Section titled “Queue Management”Dead Letter Queue (DLQ) Processing
Section titled “Dead Letter Queue (DLQ) Processing”If messages end up in a Dead Letter Queue:
- Review the messages in the DLQ using the AWS Console
- Check CloudWatch Logs for corresponding errors
- Fix the underlying issue (permission, configuration, etc.)
- Use the AWS SDK or Console to move messages back to the main queue:
import boto3
sqs = boto3.client('sqs')
# Get messages from DLQresponse = sqs.receive_message( QueueUrl='dlq-url', MaxNumberOfMessages=10, VisibilityTimeout=30)
# Move to main queuefor message in response.get('Messages', []): sqs.send_message( QueueUrl='main-queue-url', MessageBody=message['Body'] )
# Delete from DLQ sqs.delete_message( QueueUrl='dlq-url', ReceiptHandle=message['ReceiptHandle'] )Stopping Runaway Workflows
Section titled “Stopping Runaway Workflows”If too many workflows are running and need to be stopped:
- Use the provided script to stop workflows:
./scripts/stop_workflows.sh <stack-name> <pattern-name>- Purge the SQS queue if needed:
- Navigate to SQS in the AWS Console
- Select the queue
- Choose “Purge” from the Actions menu
Security Issues
Section titled “Security Issues”WAF Blocking Access
Section titled “WAF Blocking Access”If the WAF is blocking legitimate access:
- Check the
WAFAllowedIPv4Rangesparameter value - Update with correct CIDR blocks for allowed IP ranges
- Remember Lambda functions have automatic access regardless of WAF settings
Authentication Issues
Section titled “Authentication Issues”For Cognito authentication problems:
- Verify user exists in Cognito User Pool
- Check user attributes (email verified, status)
- Reset user password if needed
- Review identity pool configuration
- Check browser console for specific authentication errors
Model-Specific Troubleshooting
Section titled “Model-Specific Troubleshooting”Bedrock
Section titled “Bedrock”- Throttling: Request quota increases or reduce concurrency
- Content Filtering: Review guardrail configuration if content is being filtered unexpectedly
- Prompt Issues: Test prompts directly in Bedrock console or notebook
- Region Availability: Verify model availability in your region
SageMaker
Section titled “SageMaker”- Endpoint Cold Start: Consider using provisioned concurrency
- GPU Utilization: Monitor utilization and adjust instance type if needed
- Memory Errors: Check inference logs for out-of-memory errors
- Model Loading Errors: Verify model artifacts are correct
Advanced Troubleshooting
Section titled “Advanced Troubleshooting”End-to-End Tracing
Section titled “End-to-End Tracing”Use X-Ray tracing for advanced diagnostics:
- Enable X-Ray tracing in the CloudFormation template
- View service map in X-Ray console
- Analyze trace details for latency and error hotspots
Log Correlation
Section titled “Log Correlation”Trace document processing across systems:
- Extract correlation ID from log entries
- Search across log groups using CloudWatch Insights:
fields @timestamp, @message| filter @message like "correlation-id-here"| sort @timestamp ascPerformance Testing
Section titled “Performance Testing”Test system capacity and identify bottlenecks:
- Use load testing scripts in
./scripts/directory - Start with low document rates and increase gradually
- Monitor CloudWatch metrics for saturation points
- Identify bottlenecks and optimize configuration
Build and Deployment Issues
Section titled “Build and Deployment Issues”Publishing Script Failures
Section titled “Publishing Script Failures”| Issue | Resolution |
|---|---|
| Generic “Failed to build” error | Use --verbose flag to see detailed error messages: idp-cli publish --source-dir . --region <region> --verbose |
| Python version mismatch | Ensure Python 3.13 is installed and available in PATH. Check with python3 --version |
| SAM build fails | Verify SAM CLI is installed and up to date. Check Docker is running if using containerized builds |
| Missing dependencies | Install required packages: pip install boto3 typer rich botocore |
| Permission errors | Verify AWS credentials are configured and have necessary S3/CloudFormation permissions |
Common Build Error Messages
Section titled “Common Build Error Messages”Python Runtime Error:
Error: PythonPipBuilder:Validation - Binary validation failed for python, searched for python in following locations: [...] which did not satisfy constraints for runtime: python3.12Resolution: Install Python 3.13 and ensure it’s in your PATH, or use the --use-container flag for containerized builds.
Docker Not Running:
Error: Running AWS SAM projects locally requires DockerResolution: Start Docker daemon before running the publish script.
AWS Credentials Not Found:
Error: Unable to locate credentialsResolution: Configure AWS credentials using aws configure or set environment variables.
Verbose Mode Usage
Section titled “Verbose Mode Usage”For detailed debugging information, always use the --verbose flag when troubleshooting build issues:
# Standard usageidp-cli publish --source-dir . --region us-east-1
# Verbose mode for troubleshootingidp-cli publish --source-dir . --region us-east-1 --verboseVerbose mode provides:
- Exact SAM build commands being executed
- Complete stdout/stderr from failed operations
- Python environment and dependency information
- Detailed error traces and stack traces
Container-Based Lambda Deployment Issues
Section titled “Container-Based Lambda Deployment Issues”| Issue | Resolution |
|---|---|
| Lambda package exceeds 250MB limit | Pattern-2 uses container images automatically. For Pattern-1/3, consider reducing dependency size or switching to container images in a future update. |
| Docker daemon not running | Start Docker Desktop or Docker service before running container deployment |
| ECR login failed | Ensure AWS credentials have ECR permissions. The script will automatically handle ECR login |
| Container build fails | Check Dockerfile syntax and ensure all referenced files exist |
| Image push timeout | Check network connectivity and ECR repository permissions |
Container Deployment Behavior:
- Pattern-2 builds and pushes container images automatically when Pattern-2 changes are detected.
- Ensure Docker Desktop/service is running and your AWS credentials have ECR permissions.
- Use
--verboseto see detailed build and push logs.