Skip to content

Circuit Breaker

Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0

Protects the IDP pipeline from cascading failures when Amazon Bedrock is degraded or unavailable. When Bedrock starts returning errors at a configurable rate, the circuit breaker opens and new workflows stop starting. Messages stay in SQS instead of fanning out into Lambda retries that would eventually time out or burn through the Step Functions retry budget. Once Bedrock recovers, the breaker transitions through a half-open probe state back to closed and normal processing resumes.

Without the circuit breaker, a Bedrock outage produces this chain:

  1. Workflows start normally.
  2. Every Bedrock call hits the in-client retry loop (up to 7 attempts, exponential backoff up to 5 minutes).
  3. Step Functions retries another 8 times on failure.
  4. Individual executions hang for up to 15 minutes before failing.
  5. Meanwhile new documents keep getting pulled from SQS and starting more doomed workflows, which waste Lambda concurrency and inflate cost.

The circuit breaker short-circuits that cycle by refusing to start new workflows while Bedrock is unhealthy, so messages stay in the queue and process cleanly after recovery.

StateBehavior
CLOSEDNormal operation. All requests processed.
OPENBedrock unavailable. Queue Processor returns messages to SQS for retry.
HALF_OPENTesting recovery. Limited traffic allowed through. First successful workflow closes the breaker; first new alarm reopens it.
┌──────────┐
│ CLOSED │◄─────────── Successful workflow in HALF_OPEN
└────┬─────┘ OR alarm returns to OK
│ CloudWatch alarm fires
┌──────────┐
│ OPEN │ ◄── Alarm fires during HALF_OPEN
└────┬─────┘
│ Recovery timeout OR alarm OK
┌───────────┐
│ HALF_OPEN │
└───────────┘
┌─────────────┐ SNS ┌────────────────────┐ DynamoDB ┌──────────────┐
│ CloudWatch │───────────►│ Circuit Breaker │◄──────────────►│ Concurrency │
│ Alarm │ │ Manager │ │ Table │
└─────────────┘ └────────────────────┘ └──────────────┘
│ ▲
│ SNS │
▼ │
┌──────────────┐ │
│ AlertsTopic │ │
│ (notify ops) │ │
└──────────────┘ │
┌─────────────┐ │
│ SQS │────────────►┌─────────────────┐ check state before │
│ Queue │ │ Queue Processor │─────processing──────────┘
└─────────────┘ └─────────────────┘
│ if CLOSED or HALF_OPEN
┌─────────────────┐
│ Step Functions │
│ Workflow │
└─────────────────┘
  • BedrockServiceOutageAlarm: CloudWatch MetricMath alarm on Bedrock error metrics (see Alarm threshold).
  • CircuitBreakerManager: Lambda triggered by the alarm’s SNS topic and by a 5-minute EventBridge schedule. Manages state transitions and publishes notifications.
  • ConcurrencyTable: Existing DynamoDB table — circuit breaker state is stored on the circuit_breaker partition key.
  • QueueProcessor: Reads state before starting workflows. If OPEN, it returns without starting the Step Functions execution, leaving the message in SQS.
  • WorkflowTracker: On a successful workflow completion, if state is HALF_OPEN, transitions to CLOSED.

The BedrockServiceOutageAlarm uses MetricMath to sum the Bedrock error categories you opt into and compares the total to CircuitBreakerFailureThreshold.

Expression:

<SU> * FILL(m1, 0) + <Thr> * FILL(m2, 0) + <QL> * FILL(m3, 0)

Where each coefficient is 1 if the corresponding trigger is enabled and 0 otherwise. Metrics m1/m2/m3 are BedrockServiceUnavailable, BedrockThrottling, and BedrockQuotaLimit under the stack namespace.

ParameterDefaultDescription
CircuitBreakerEnabledfalseMaster switch. Set to true to provision the alarm, SNS topic, manager Lambda, and traffic gate.
CircuitBreakerTriggerServiceUnavailabletrueCount 503 ServiceUnavailableException errors toward threshold.
CircuitBreakerTriggerThrottlingfalseCount ThrottlingException, TooManyRequestsException, RequestLimitExceeded.
CircuitBreakerTriggerQuotaLimitfalseCount ServiceQuotaExceededException.
CircuitBreakerFailureThreshold3Combined error count per 5-minute period to breach.
CircuitBreakerEvaluationPeriods1Consecutive periods that must breach.
CircuitBreakerRecoveryTimeoutSeconds300Seconds before automatic OPEN → HALF_OPEN transition.
CircuitBreakerErrorHandlerArn(empty)Optional Lambda ARN invoked on state changes for custom handling.

Default behavior when enabled: 3 or more ServiceUnavailableException errors in a single 5-minute window open the breaker. Throttling and quota-limit errors are not counted by default because those usually indicate client-side load issues, not a Bedrock outage. Enable additional triggers to protect against sustained throttling or quota exhaustion.

The Bedrock client in idp_common emits category-specific CloudWatch metrics under the stack namespace whenever it catches a retryable error:

MetricBedrock exception code(s)Typical cause
BedrockServiceUnavailableServiceUnavailableException (503)Bedrock service degradation or regional outage
BedrockThrottlingThrottlingException, TooManyRequestsException, RequestLimitExceededClient-side throughput limits reached
BedrockQuotaLimitServiceQuotaExceededExceptionAccount quota exhausted for the model

These metrics are emitted unconditionally, independent of CircuitBreakerEnabled, so you can observe Bedrock error rates even when the circuit breaker is disabled.

Set CircuitBreakerEnabled=true at deploy time:

Terminal window
aws cloudformation deploy \
--stack-name my-idp-stack \
--template-file template.yaml \
--parameter-overrides \
CircuitBreakerEnabled=true \
CircuitBreakerFailureThreshold=3 \
CircuitBreakerTriggerThrottling=true

Or via the IDP CLI / console. When CircuitBreakerEnabled=false (the default), none of the circuit breaker resources are provisioned — no alarm, no SNS topic, no manager Lambda — and the Queue Processor skips the state check entirely.

For GovCloud or other environments experiencing intermittent Bedrock outages:

CircuitBreakerFailureThreshold: 3
CircuitBreakerEvaluationPeriods: 1 # 5-minute window
CircuitBreakerRecoveryTimeoutSeconds: 300 # 5 minutes before probing

For stable environments where you want the breaker as a safety net only:

CircuitBreakerFailureThreshold: 5
CircuitBreakerEvaluationPeriods: 2 # 10-minute window
CircuitBreakerRecoveryTimeoutSeconds: 600 # 10 minutes

When CircuitBreakerEnabled=true, the document list header shows a live status badge that reflects the current breaker state via an AppSync subscription:

BadgeStateMeaning
Green “Circuit: closed”CLOSEDNormal operation
Blue “Circuit: recovering”HALF_OPENProbing recovery
Red “Circuit: Bedrock outage”OPEN (automatic)Opened by BedrockServiceOutageAlarm; hover for lastError
Red “Circuit: manually paused”OPEN (manual)Opened via the admin Pause processing control; hover for the reason

When CircuitBreakerEnabled=false the badge is hidden entirely.

Click the badge to open a details panel showing state, openedAt, lastCheckedAt, failureCount, recoveryAttempts, and lastError. Users in the Admin Cognito group additionally see three controls:

  • Pause processing — forces OPEN (available when state is CLOSED or HALF_OPEN). Use before planned Bedrock changes or to quiesce the pipeline.
  • Resume processing — forces CLOSED and resets failure/recovery counters. Use to clear a stuck OPEN state.
  • Probe recovery — forces HALF_OPEN (available when state is OPEN). Use to test recovery before the automatic timeout.

Each control requires a reason that is persisted to DynamoDB (lastError field for pause; also logged) and broadcast over the existing SNS alerts topic. All transitions — including automatic ones from CloudWatch alarms, the scheduled health check, and the HALF_OPEN → CLOSED transition triggered by a successful workflow completion — fan out to every connected browser in real time.

Non-admins can view the panel but do not see the control buttons.

Reset the circuit breaker (force CLOSED):

Terminal window
aws lambda invoke --function-name <CircuitBreakerManagerFunctionName> \
--payload '{"action": "reset"}' response.json

Check current state:

Terminal window
aws lambda invoke --function-name <CircuitBreakerManagerFunctionName> \
--payload '{"action": "get_state"}' response.json

Or read state directly from DynamoDB:

Terminal window
aws dynamodb get-item \
--table-name <ConcurrencyTable> \
--key '{"counter_id": {"S": "circuit_breaker"}}'

CloudWatch metrics emitted by the circuit breaker (under the stack namespace):

  • CircuitBreakerOpened — incremented each time the breaker transitions to OPEN
  • CircuitBreakerHalfOpen — incremented on transition to HALF_OPEN
  • CircuitBreakerClosed — incremented on transition to CLOSED

The AlertsTopic receives SNS notifications on every state transition so operators can subscribe email, SMS, or PagerDuty endpoints.