INC-2025-03-12-00142
Production Lambda cascade failure — API Gateway 5xx storm (us-east-1) · Account 123456789012
⬤ CRITICAL
● Active
⏱ Duration: 1h 48m
Incident overview
Incident start
Mar 12, 2025, 09:14:07 UTC
Last updated
Mar 12, 2025, 11:02:41 UTC
Response plan
Incident owner
AWS account
123456789012 (prod-platform)
Region
us-east-1 (N. Virginia)
Affected services
Lambda, API Gateway, DynamoDB, SQS
Customer impact
Customer-facing API degraded (~78% error rate)
Active runbook
lambda-timeout-cascade-v3 (Step 3 / 4)
Incident summary
Lambda function prod-api-handler-v2 began timing out at 09:14 UTC due to a dependency
bottleneck in the user-profile-table-prod DynamoDB table. Unindexed table scans
introduced by the 08:55 UTC deployment caused
ProvisionedThroughputExceededException rates
to climb to 94%, exhausting read capacity within minutes. The timeouts triggered exponential retry
storms across all 6 invocation shards, consuming the full reserved concurrency pool (400/400) by 09:17.
API Gateway stage prod-api-v2-gateway/prod began returning 502/504 errors to all
downstream consumers. CloudWatch alarms fired on Lambda/Throttles,
Lambda/Errors, and ApiGateway/5XXError. Runbook steps 1–2 applied; step 3
pending execution.
Triggered alarms
4 alarms in ALARM state
| Alarm name | State | Namespace | Metric | Threshold | Current value |
|---|---|---|---|---|---|
| prod-api-handler-v2-errors-critical | ● ALARM | AWS/Lambda | Errors | > 50 / 5 min | 1,847 |
| prod-api-handler-v2-throttles | ● ALARM | AWS/Lambda | Throttles | > 100 / 5 min | 2,219 |
| prod-api-gw-5xx-storm | ● ALARM | AWS/ApiGateway | 5XXError rate | > 1% / 1 min | 78.4 % |
| dynamodb-user-profile-read-latency | ● ALARM | AWS/DynamoDB | SuccessfulRequestLatency | > 200 ms p99 | 4,312 ms |
Incident timeline
2025-03-12 09:14:07 UTC
Incident created — alarm threshold breached
CloudWatch alarm
prod-api-handler-v2-errors-critical transitioned to ALARM.
Incident auto-created via response plan prod-api-critical-response.
Impact: customer-facing API gateway experiencing cascading 5xx errors.
2025-03-12 09:17:44 UTC
Escalation triggered — on-call paged via PagerDuty
Escalation plan prod-platform-oncall-tier1 activated.
On-call engineer alex.morgan@company.com paged.
Acknowledged at 09:22 UTC (4 min 18 s response time).
2025-03-12 09:29:03 UTC
Root cause identified — DynamoDB hot partition
DynamoDB table
user-profile-table-prod hot partition detected via X-Ray trace analysis.
ProvisionedThroughputExceededException rate: 94%.
Root cause: unindexed Scan operation introduced in commit d4f9a2b
deployed at 08:55 UTC draining read capacity (baseline: 1,000 RCU).
2025-03-12 09:41:18 UTC
Runbook Step 1 complete — DynamoDB capacity scaled
Read capacity units scaled from 1,000 → 5,000 via SSM Automation
(
AWS-ScaleDynamoDBTableCapacity). DynamoDB p99 latency improving:
4,312 ms → 2,140 ms. Lambda retry storm continuing; concurrency pool still saturated.
2025-03-12 10:05:52 UTC
Runbook Step 2 complete — Lambda reserved concurrency reduced
Reserved concurrency for
prod-api-handler-v2 temporarily reduced 400 → 50
to throttle retry load and allow DynamoDB to drain. Error rate responding:
78.4% → 34.1%. SQS dead-letter queue backlog: 14,892 messages.
Step 3 automated remediation output now available below.
2025-03-12 11:01:27 UTC
Auto-remediation runbook posted Step 3 recommendations
AWS Systems Manager Automation
lambda-timeout-cascade-v3 completed analysis
of current metric state and posted Step 3 recommended actions.
Actions pending engineer execution.
Involved resources
| Type | Resource name / ARN | Region | Current status |
|---|---|---|---|
| Lambda |
arn:aws:lambda:us-east-1:123456789012:function:prod-api-handler-v2
|
us-east-1 | Throttled (concurrency 50/400) |
| API Gateway |
arn:aws:apigateway:us-east-1::/restapis/ab3c9dxyzq/stages/prod
|
us-east-1 | Degraded (34.1% 5xx) |
| DynamoDB |
arn:aws:dynamodb:us-east-1:123456789012:table/user-profile-table-prod
|
us-east-1 | Hot partition recovering (p99: 2,140 ms) |
| SQS |
arn:aws:sqs:us-east-1:123456789012:prod-api-retry-queue
|
us-east-1 | Backlogged (14,892 messages in DLQ) |
| CloudWatch |
arn:aws:cloudwatch:us-east-1:123456789012:alarm:prod-api-handler-v2-errors-critical
|
us-east-1 | IN ALARM |
Runbook — Recommended actions
Step 3 of 4
⚙ AUTO-REMEDIATION ENGINE
Step 3 recommendations — posted by Systems Manager Automation
{fill}
These recommendations were generated by the lambda-timeout-cascade-v3
automation runbook based on real-time metric analysis. All actions target account
123456789012 in us-east-1. Review carefully before
executing — use the Execute actions button above to apply all steps, or
execute them individually via the AWS CLI or console.
Notification history
4 contacts notified
AM
Acknowledged — On-call engineer paged and acknowledged at 09:22 UTC
(4 min 18 s). Joined incident channel.
#
Sent — Incident notification posted to channel.
14 team members currently active in the thread.
ES
Sent — Engineering leadership distribution list (12 recipients) notified.
JK
Acknowledged — Escalation to VP Engineering confirmed.
Acknowledged via SMS at 09:31 UTC.