Skip to content

GovCloud Operations

Monitoring, troubleshooting, and operational best practices for the GenAI IDP solution in GovCloud.

The solution deploys CloudWatch dashboards automatically. Access them via the CloudWatch console in your GovCloud region.

Key metrics to monitor:

  • Step Functions: Execution success/failure rates, duration
  • Lambda Functions: Invocation count, error rate, duration, throttles
  • SQS Queues: Queue depth, age of oldest message
  • DynamoDB: Read/write capacity, throttled requests

Out of the box, the stack creates alarms for two step-function conditions:

  • Step Functions execution failures
  • Step Function slow executions

Alarms publish to an SNS topic — subscribe your team’s email or pager to receive notifications.

All Workflow Lambda functions write to dedicated CloudWatch Log Groups with the naming convention /{stack-name}-stack-PATTERN2STACK-{cfn-nested-stack-id}/lambda/{function}. The functions for Pattern 2 are:

OCRFunction,
ClassificationFunction,
ExtractionFunction,
AssessmentFunction,
ProcessResultsFunction,
SummarizationFunction,
EvaluationFunction,
RuleValidationFunction,
RuleValidationOrchestrationFunction.

Use CloudWatch Logs Insights to query across functions:

fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 50

The headless deployment of the IDP solution provisions additional lambda functions to support the headless workflow

  1. API Lambda Handler

    • Log Group {stack-name}-ApiHandlerLogGroup-{cfn-id}
  2. Batch Pre-Processor

    • Log Group: /aws/lambda/{stack-name}-BatchPreProcessorFunction-{cfn-id}
  3. Job Tracker

    • Log Group: {stack-name}-JobTrackerLogGroup-{cfn-id}

    Note: {cfn-id} is a unique alphanumeric string generated by CloudFormation at stack creation time. It is unique for each resource and is stable across stack updates; however, it changes if the stack is deleted and recreated.

  1. Check Step Functions execution history: Open the Step Functions console, find the failed execution, and inspect the failed state’s input/output
  2. Check Lambda logs: The failed state maps to a specific Lambda — check its CloudWatch log group for the error
  3. Common causes:
    • Textract unable to process document (unsupported format, corrupt file)
    • Bedrock model throttling (check for ThrottlingException)

If Lambda functions timeout after deploying in a VPC:

  • Verify all required VPC endpoints exist (see VPC Deployment Guide)
  • Check the VPC interface endpoint security group allows HTTPS inbound traffic (port 443) from the CIDR range or security groups of the lambda functions
  • Check the private subnets that lambdas are deployed into and confirm VPC Gateway endpoint routes exist for S3 and DynamoDb

If documents are queuing up and not processing:

  • Check SQS queue depth in CloudWatch
  • Verify Lambda concurrency limits aren’t being hit
  • Check the DynamoDB concurrency table for stuck entries
  • Look for throttling errors in the QueueProcessor Lambda logs
  • Set up SNS subscriptions for the alarm topic before processing production workloads
  • Enable S3 access logging on the input and output buckets for audit trails
  • Review CloudWatch dashboards on a regular cadence to catch trends before they become incidents
  • Test failover by processing sample documents after any infrastructure changes
  • Monitor costs through AWS Cost and Usage Reports. Note: GovCloud accounts do not directly have access to this information and must log into their commercial account partitions for access. Check out AWS GovCloud (US) Billing and Payment for more information.