Incident overview
Incident start
Mar 12, 2025, 09:14:07 UTC
Last updated
Mar 12, 2025, 11:02:41 UTC
Incident owner
AWS account
123456789012 (prod-platform)
Region
us-east-1 (N. Virginia)
Affected services
Lambda, API Gateway, DynamoDB, SQS
Customer impact
Customer-facing API degraded (~78% error rate)
Active runbook
Incident summary
Lambda function prod-api-handler-v2 began timing out at 09:14 UTC due to a dependency bottleneck in the user-profile-table-prod DynamoDB table. Unindexed table scans introduced by the 08:55 UTC deployment caused ProvisionedThroughputExceededException rates to climb to 94%, exhausting read capacity within minutes. The timeouts triggered exponential retry storms across all 6 invocation shards, consuming the full reserved concurrency pool (400/400) by 09:17. API Gateway stage prod-api-v2-gateway/prod began returning 502/504 errors to all downstream consumers. CloudWatch alarms fired on Lambda/Throttles, Lambda/Errors, and ApiGateway/5XXError. Runbook steps 1–2 applied; step 3 pending execution.
Triggered alarms 4 alarms in ALARM state
Alarm name State Namespace Metric Threshold Current value
prod-api-handler-v2-errors-critical ● ALARM AWS/Lambda Errors > 50 / 5 min 1,847
prod-api-handler-v2-throttles ● ALARM AWS/Lambda Throttles > 100 / 5 min 2,219
prod-api-gw-5xx-storm ● ALARM AWS/ApiGateway 5XXError rate > 1% / 1 min 78.4 %
dynamodb-user-profile-read-latency ● ALARM AWS/DynamoDB SuccessfulRequestLatency > 200 ms p99 4,312 ms
Incident timeline
2025-03-12 09:14:07 UTC
Incident created — alarm threshold breached
CloudWatch alarm prod-api-handler-v2-errors-critical transitioned to ALARM. Incident auto-created via response plan prod-api-critical-response. Impact: customer-facing API gateway experiencing cascading 5xx errors.
2025-03-12 09:17:44 UTC
Escalation triggered — on-call paged via PagerDuty
Escalation plan prod-platform-oncall-tier1 activated. On-call engineer alex.morgan@company.com paged. Acknowledged at 09:22 UTC (4 min 18 s response time).
2025-03-12 09:29:03 UTC
Root cause identified — DynamoDB hot partition
DynamoDB table user-profile-table-prod hot partition detected via X-Ray trace analysis. ProvisionedThroughputExceededException rate: 94%. Root cause: unindexed Scan operation introduced in commit d4f9a2b deployed at 08:55 UTC draining read capacity (baseline: 1,000 RCU).
2025-03-12 09:41:18 UTC
Runbook Step 1 complete — DynamoDB capacity scaled
Read capacity units scaled from 1,000 → 5,000 via SSM Automation (AWS-ScaleDynamoDBTableCapacity). DynamoDB p99 latency improving: 4,312 ms → 2,140 ms. Lambda retry storm continuing; concurrency pool still saturated.
2025-03-12 10:05:52 UTC
Runbook Step 2 complete — Lambda reserved concurrency reduced
Reserved concurrency for prod-api-handler-v2 temporarily reduced 400 → 50 to throttle retry load and allow DynamoDB to drain. Error rate responding: 78.4% → 34.1%. SQS dead-letter queue backlog: 14,892 messages. Step 3 automated remediation output now available below.
2025-03-12 11:01:27 UTC
Auto-remediation runbook posted Step 3 recommendations
AWS Systems Manager Automation lambda-timeout-cascade-v3 completed analysis of current metric state and posted Step 3 recommended actions. Actions pending engineer execution.
Involved resources
Type Resource name / ARN Region Current status
Lambda
arn:aws:lambda:us-east-1:123456789012:function:prod-api-handler-v2
us-east-1 Throttled (concurrency 50/400)
API Gateway
arn:aws:apigateway:us-east-1::/restapis/ab3c9dxyzq/stages/prod
us-east-1 Degraded (34.1% 5xx)
DynamoDB
arn:aws:dynamodb:us-east-1:123456789012:table/user-profile-table-prod
us-east-1 Hot partition recovering (p99: 2,140 ms)
SQS
arn:aws:sqs:us-east-1:123456789012:prod-api-retry-queue
us-east-1 Backlogged (14,892 messages in DLQ)
CloudWatch
arn:aws:cloudwatch:us-east-1:123456789012:alarm:prod-api-handler-v2-errors-critical
us-east-1 IN ALARM
Runbook — Recommended actions Step 3 of 4
⚙ AUTO-REMEDIATION ENGINE
Automation document
Generated at
Mar 12, 2025, 11:01:27 UTC
Step 3 recommendations — posted by Systems Manager Automation
ℹ️ These recommendations were generated by the lambda-timeout-cascade-v3 automation runbook based on real-time metric analysis. All actions target account 123456789012 in us-east-1. Review carefully before executing — use the Execute actions button above to apply all steps, or execute them individually via the AWS CLI or console.
Notification history 4 contacts notified
AM
alex.morgan@company.com PagerDuty 09:17:44 UTC
Acknowledged — On-call engineer paged and acknowledged at 09:22 UTC (4 min 18 s). Joined incident channel.
#
#incident-prod-critical Slack 09:17:50 UTC
Sent — Incident notification posted to channel. 14 team members currently active in the thread.
ES
engineering-leads@company.com Email 09:18:02 UTC
Sent — Engineering leadership distribution list (12 recipients) notified.
JK
j.kim@company.com (VP Engineering) SMS Email 09:25:00 UTC
Acknowledged — Escalation to VP Engineering confirmed. Acknowledged via SMS at 09:31 UTC.