INC-2025-03-12-00142 - Incident Manager

Incident overview

Incident start

Mar 12, 2025, 09:14:07 UTC

Last updated

Mar 12, 2025, 11:02:41 UTC

Response plan

prod-api-critical-response

Incident owner

alex.morgan@company.com

AWS account

123456789012 (prod-platform)

Region

us-east-1 (N. Virginia)

Affected services

Lambda, API Gateway, DynamoDB, SQS

Customer impact

Customer-facing API degraded (~78% error rate)

Active runbook

lambda-timeout-cascade-v3 (Step 3 / 4)

Incident summary

Lambda function prod-api-handler-v2 began timing out at 09:14 UTC due to a dependency bottleneck in the user-profile-table-prod DynamoDB table. Unindexed table scans introduced by the 08:55 UTC deployment caused ProvisionedThroughputExceededException rates to climb to 94%, exhausting read capacity within minutes. The timeouts triggered exponential retry storms across all 6 invocation shards, consuming the full reserved concurrency pool (400/400) by 09:17. API Gateway stage prod-api-v2-gateway/prod began returning 502/504 errors to all downstream consumers. CloudWatch alarms fired on Lambda/Throttles, Lambda/Errors, and ApiGateway/5XXError. Runbook steps 1–2 applied; step 3 pending execution.

Triggered alarms 4 alarms in ALARM state

Alarm name	State	Namespace	Metric	Threshold	Current value
prod-api-handler-v2-errors-critical	● ALARM	AWS/Lambda	Errors	> 50 / 5 min	1,847
prod-api-handler-v2-throttles	● ALARM	AWS/Lambda	Throttles	> 100 / 5 min	2,219
prod-api-gw-5xx-storm	● ALARM	AWS/ApiGateway	5XXError rate	> 1% / 1 min	78.4 %
dynamodb-user-profile-read-latency	● ALARM	AWS/DynamoDB	SuccessfulRequestLatency	> 200 ms p99	4,312 ms

Incident timeline

2025-03-12 09:14:07 UTC

Incident created — alarm threshold breached

CloudWatch alarm prod-api-handler-v2-errors-critical transitioned to ALARM. Incident auto-created via response plan prod-api-critical-response. Impact: customer-facing API gateway experiencing cascading 5xx errors.

2025-03-12 09:17:44 UTC

Escalation triggered — on-call paged via PagerDuty

Escalation plan prod-platform-oncall-tier1 activated. On-call engineer alex.morgan@company.com paged. Acknowledged at 09:22 UTC (4 min 18 s response time).

2025-03-12 09:29:03 UTC

Root cause identified — DynamoDB hot partition

DynamoDB table user-profile-table-prod hot partition detected via X-Ray trace analysis. ProvisionedThroughputExceededException rate: 94%. Root cause: unindexed Scan operation introduced in commit d4f9a2b deployed at 08:55 UTC draining read capacity (baseline: 1,000 RCU).

2025-03-12 09:41:18 UTC

Runbook Step 1 complete — DynamoDB capacity scaled

Read capacity units scaled from 1,000 → 5,000 via SSM Automation (AWS-ScaleDynamoDBTableCapacity). DynamoDB p99 latency improving: 4,312 ms → 2,140 ms. Lambda retry storm continuing; concurrency pool still saturated.

2025-03-12 10:05:52 UTC

Runbook Step 2 complete — Lambda reserved concurrency reduced

Reserved concurrency for prod-api-handler-v2 temporarily reduced 400 → 50 to throttle retry load and allow DynamoDB to drain. Error rate responding: 78.4% → 34.1%. SQS dead-letter queue backlog: 14,892 messages. Step 3 automated remediation output now available below.

2025-03-12 11:01:27 UTC

Auto-remediation runbook posted Step 3 recommendations

AWS Systems Manager Automation lambda-timeout-cascade-v3 completed analysis of current metric state and posted Step 3 recommended actions. Actions pending engineer execution.

Involved resources

Type	Resource name / ARN	Region	Current status
Lambda	prod-api-handler-v2 arn:aws:lambda:us-east-1:123456789012:function:prod-api-handler-v2	us-east-1	Throttled (concurrency 50/400)
API Gateway	prod-api-v2-gateway / prod stage arn:aws:apigateway:us-east-1::/restapis/ab3c9dxyzq/stages/prod	us-east-1	Degraded (34.1% 5xx)
DynamoDB	user-profile-table-prod arn:aws:dynamodb:us-east-1:123456789012:table/user-profile-table-prod	us-east-1	Hot partition recovering (p99: 2,140 ms)
SQS	prod-api-retry-queue arn:aws:sqs:us-east-1:123456789012:prod-api-retry-queue	us-east-1	Backlogged (14,892 messages in DLQ)
CloudWatch	prod-api-handler-v2-errors-critical arn:aws:cloudwatch:us-east-1:123456789012:alarm:prod-api-handler-v2-errors-critical	us-east-1	IN ALARM

Runbook — Recommended actions Step 3 of 4

⚙ AUTO-REMEDIATION ENGINE

Runbook

lambda-timeout-cascade-v3

Automation document

AWS-LambdaCascadeRemediation

Generated at

Mar 12, 2025, 11:01:27 UTC

Step 3 recommendations — posted by Systems Manager Automation

{fill}

ℹ️ These recommendations were generated by the lambda-timeout-cascade-v3 automation runbook based on real-time metric analysis. All actions target account 123456789012 in us-east-1. Review carefully before executing — use the Execute actions button above to apply all steps, or execute them individually via the AWS CLI or console.

Notification history 4 contacts notified

AM

alex.morgan@company.com PagerDuty 09:17:44 UTC

Acknowledged — On-call engineer paged and acknowledged at 09:22 UTC (4 min 18 s). Joined incident channel.

#

#incident-prod-critical Slack 09:17:50 UTC

Sent — Incident notification posted to channel. 14 team members currently active in the thread.

ES

engineering-leads@company.com Email 09:18:02 UTC

Sent — Engineering leadership distribution list (12 recipients) notified.

JK

j.kim@company.com (VP Engineering) SMS Email 09:25:00 UTC

Acknowledged — Escalation to VP Engineering confirmed. Acknowledged via SMS at 09:31 UTC.