Circuit Breaker
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0
Circuit Breaker
Section titled “Circuit Breaker”Protects the IDP pipeline from cascading failures when Amazon Bedrock is degraded or unavailable. When Bedrock starts returning errors at a configurable rate, the circuit breaker opens and new workflows stop starting. Messages stay in SQS instead of fanning out into Lambda retries that would eventually time out or burn through the Step Functions retry budget. Once Bedrock recovers, the breaker transitions through a half-open probe state back to closed and normal processing resumes.
Why it exists
Section titled “Why it exists”Without the circuit breaker, a Bedrock outage produces this chain:
- Workflows start normally.
- Every Bedrock call hits the in-client retry loop (up to 7 attempts, exponential backoff up to 5 minutes).
- Step Functions retries another 8 times on failure.
- Individual executions hang for up to 15 minutes before failing.
- Meanwhile new documents keep getting pulled from SQS and starting more doomed workflows, which waste Lambda concurrency and inflate cost.
The circuit breaker short-circuits that cycle by refusing to start new workflows while Bedrock is unhealthy, so messages stay in the queue and process cleanly after recovery.
States
Section titled “States”| State | Behavior |
|---|---|
| CLOSED | Normal operation. All requests processed. |
| OPEN | Bedrock unavailable. Queue Processor returns messages to SQS for retry. |
| HALF_OPEN | Testing recovery. Limited traffic allowed through. First successful workflow closes the breaker; first new alarm reopens it. |
Transitions
Section titled “Transitions” ┌──────────┐ │ CLOSED │◄─────────── Successful workflow in HALF_OPEN └────┬─────┘ OR alarm returns to OK │ │ CloudWatch alarm fires ▼ ┌──────────┐ │ OPEN │ ◄── Alarm fires during HALF_OPEN └────┬─────┘ │ │ Recovery timeout OR alarm OK ▼ ┌───────────┐ │ HALF_OPEN │ └───────────┘Architecture
Section titled “Architecture”┌─────────────┐ SNS ┌────────────────────┐ DynamoDB ┌──────────────┐│ CloudWatch │───────────►│ Circuit Breaker │◄──────────────►│ Concurrency ││ Alarm │ │ Manager │ │ Table │└─────────────┘ └────────────────────┘ └──────────────┘ │ ▲ │ SNS │ ▼ │ ┌──────────────┐ │ │ AlertsTopic │ │ │ (notify ops) │ │ └──────────────┘ │ │┌─────────────┐ ││ SQS │────────────►┌─────────────────┐ check state before ││ Queue │ │ Queue Processor │─────processing──────────┘└─────────────┘ └─────────────────┘ │ │ if CLOSED or HALF_OPEN ▼ ┌─────────────────┐ │ Step Functions │ │ Workflow │ └─────────────────┘- BedrockServiceOutageAlarm: CloudWatch MetricMath alarm on Bedrock error metrics (see Alarm threshold).
- CircuitBreakerManager: Lambda triggered by the alarm’s SNS topic and by a 5-minute EventBridge schedule. Manages state transitions and publishes notifications.
- ConcurrencyTable: Existing DynamoDB table — circuit breaker state is stored on the
circuit_breakerpartition key. - QueueProcessor: Reads state before starting workflows. If OPEN, it returns without starting the Step Functions execution, leaving the message in SQS.
- WorkflowTracker: On a successful workflow completion, if state is HALF_OPEN, transitions to CLOSED.
Alarm threshold
Section titled “Alarm threshold”The BedrockServiceOutageAlarm uses MetricMath to sum the Bedrock error categories you opt into and compares the total to CircuitBreakerFailureThreshold.
Expression:
<SU> * FILL(m1, 0) + <Thr> * FILL(m2, 0) + <QL> * FILL(m3, 0)Where each coefficient is 1 if the corresponding trigger is enabled and 0 otherwise. Metrics m1/m2/m3 are BedrockServiceUnavailable, BedrockThrottling, and BedrockQuotaLimit under the stack namespace.
| Parameter | Default | Description |
|---|---|---|
CircuitBreakerEnabled | false | Master switch. Set to true to provision the alarm, SNS topic, manager Lambda, and traffic gate. |
CircuitBreakerTriggerServiceUnavailable | true | Count 503 ServiceUnavailableException errors toward threshold. |
CircuitBreakerTriggerThrottling | false | Count ThrottlingException, TooManyRequestsException, RequestLimitExceeded. |
CircuitBreakerTriggerQuotaLimit | false | Count ServiceQuotaExceededException. |
CircuitBreakerFailureThreshold | 3 | Combined error count per 5-minute period to breach. |
CircuitBreakerEvaluationPeriods | 1 | Consecutive periods that must breach. |
CircuitBreakerRecoveryTimeoutSeconds | 300 | Seconds before automatic OPEN → HALF_OPEN transition. |
CircuitBreakerErrorHandlerArn | (empty) | Optional Lambda ARN invoked on state changes for custom handling. |
Default behavior when enabled: 3 or more ServiceUnavailableException errors in a single 5-minute window open the breaker. Throttling and quota-limit errors are not counted by default because those usually indicate client-side load issues, not a Bedrock outage. Enable additional triggers to protect against sustained throttling or quota exhaustion.
Error categories
Section titled “Error categories”The Bedrock client in idp_common emits category-specific CloudWatch metrics under the stack namespace whenever it catches a retryable error:
| Metric | Bedrock exception code(s) | Typical cause |
|---|---|---|
BedrockServiceUnavailable | ServiceUnavailableException (503) | Bedrock service degradation or regional outage |
BedrockThrottling | ThrottlingException, TooManyRequestsException, RequestLimitExceeded | Client-side throughput limits reached |
BedrockQuotaLimit | ServiceQuotaExceededException | Account quota exhausted for the model |
These metrics are emitted unconditionally, independent of CircuitBreakerEnabled, so you can observe Bedrock error rates even when the circuit breaker is disabled.
Enabling the circuit breaker
Section titled “Enabling the circuit breaker”Set CircuitBreakerEnabled=true at deploy time:
aws cloudformation deploy \ --stack-name my-idp-stack \ --template-file template.yaml \ --parameter-overrides \ CircuitBreakerEnabled=true \ CircuitBreakerFailureThreshold=3 \ CircuitBreakerTriggerThrottling=trueOr via the IDP CLI / console. When CircuitBreakerEnabled=false (the default), none of the circuit breaker resources are provisioned — no alarm, no SNS topic, no manager Lambda — and the Queue Processor skips the state check entirely.
Tuning guidance
Section titled “Tuning guidance”For GovCloud or other environments experiencing intermittent Bedrock outages:
CircuitBreakerFailureThreshold: 3CircuitBreakerEvaluationPeriods: 1 # 5-minute windowCircuitBreakerRecoveryTimeoutSeconds: 300 # 5 minutes before probingFor stable environments where you want the breaker as a safety net only:
CircuitBreakerFailureThreshold: 5CircuitBreakerEvaluationPeriods: 2 # 10-minute windowCircuitBreakerRecoveryTimeoutSeconds: 600 # 10 minutesWeb UI
Section titled “Web UI”When CircuitBreakerEnabled=true, the document list header shows a live status badge that reflects the current breaker state via an AppSync subscription:
| Badge | State | Meaning |
|---|---|---|
| Green “Circuit: closed” | CLOSED | Normal operation |
| Blue “Circuit: recovering” | HALF_OPEN | Probing recovery |
| Red “Circuit: Bedrock outage” | OPEN (automatic) | Opened by BedrockServiceOutageAlarm; hover for lastError |
| Red “Circuit: manually paused” | OPEN (manual) | Opened via the admin Pause processing control; hover for the reason |
When CircuitBreakerEnabled=false the badge is hidden entirely.
Click the badge to open a details panel showing state, openedAt, lastCheckedAt, failureCount, recoveryAttempts, and lastError. Users in the Admin Cognito group additionally see three controls:
- Pause processing — forces OPEN (available when state is CLOSED or HALF_OPEN). Use before planned Bedrock changes or to quiesce the pipeline.
- Resume processing — forces CLOSED and resets failure/recovery counters. Use to clear a stuck OPEN state.
- Probe recovery — forces HALF_OPEN (available when state is OPEN). Use to test recovery before the automatic timeout.
Each control requires a reason that is persisted to DynamoDB (lastError field for pause; also logged) and broadcast over the existing SNS alerts topic. All transitions — including automatic ones from CloudWatch alarms, the scheduled health check, and the HALF_OPEN → CLOSED transition triggered by a successful workflow completion — fan out to every connected browser in real time.
Non-admins can view the panel but do not see the control buttons.
Manual operations
Section titled “Manual operations”Reset the circuit breaker (force CLOSED):
aws lambda invoke --function-name <CircuitBreakerManagerFunctionName> \ --payload '{"action": "reset"}' response.jsonCheck current state:
aws lambda invoke --function-name <CircuitBreakerManagerFunctionName> \ --payload '{"action": "get_state"}' response.jsonOr read state directly from DynamoDB:
aws dynamodb get-item \ --table-name <ConcurrencyTable> \ --key '{"counter_id": {"S": "circuit_breaker"}}'Observability
Section titled “Observability”CloudWatch metrics emitted by the circuit breaker (under the stack namespace):
CircuitBreakerOpened— incremented each time the breaker transitions to OPENCircuitBreakerHalfOpen— incremented on transition to HALF_OPENCircuitBreakerClosed— incremented on transition to CLOSED
The AlertsTopic receives SNS notifications on every state transition so operators can subscribe email, SMS, or PagerDuty endpoints.