Article

How to Run Incident Post-Mortems That Actually Prevent Repeat Failures

Author
How to Run Incident Post-Mortems That Actually Prevent Repeat Failures

The Post-Mortem Problem

Every engineering organization runs post-mortems. Very few do them well.

The typical post-mortem cycle looks like this:

  1. Major incident occurs
  2. Team scrambles to resolve it
  3. Someone schedules a post-mortem meeting for the following week
  4. The meeting happens (sometimes)
  5. A document is written with "root cause" and "action items"
  6. The document is filed in Google Drive or Confluence
  7. The action items never get done
  8. The same incident happens again three months later
  9. During the second post-mortem, someone finds the first post-mortem document
  10. Everyone is embarrassed

This isn't a discipline problem. It's a structural problem. Post-mortems fail because they're disconnected from the incidents they analyze and the systems that track follow-through.

Why Most Post-Mortems Fail

The Document Graveyard

Post-mortem documents written in Google Docs or Confluence are effectively write-once files. They get shared in a Slack channel, a few people read them, and then they join thousands of other documents in a shared drive nobody browses.

The information is technically accessible but practically invisible. When a similar incident occurs months later, nobody thinks to search for the old post-mortem.

The Action Item Black Hole

Action items from post-mortems compete with feature work, bug fixes, and other priorities. Without a forcing function — a system that surfaces incomplete action items and reminds owners — they quietly drop off the priority list.

"Improve monitoring for database connection pools" sounds important in the post-mortem meeting. Two weeks later, it's item #47 on a backlog nobody reviews.

The Blame Problem

Despite widespread agreement that post-mortems should be "blameless," many organizations still conduct them in ways that make individuals feel defensive. When people feel like they're being evaluated, they provide less information, not more.

The result: post-mortems that describe what happened but not why, because the "why" might point to a person's decision rather than a system's failure.

Anatomy of a Useful Post-Mortem

A post-mortem that actually prevents repeat failures has five sections, each serving a specific purpose.

1. Incident Summary

A brief, factual description of what happened:

  • What service(s) were affected
  • Duration of impact
  • User/customer impact (specific numbers if possible)
  • Severity level
  • Who was involved in the response

Keep this to 3-5 sentences. The summary should be understandable by someone unfamiliar with the system.

2. Timeline

A chronological record of events, including:

  • When the issue started (detected vs. actual start)
  • When it was detected (and by what — monitoring, customer report, Slack message)
  • Key investigation steps and pivots
  • When the root cause was identified
  • When the fix was implemented
  • When service was fully restored
  • When the incident was officially closed

The timeline should come from your incident management tool's activity log, not from memory. Reconstructing timelines from memory introduces bias and inaccuracy.

3. Root Cause Analysis (RCA)

The RCA is the core of the post-mortem. It answers not just "what broke" but "why did it break" and "why didn't we catch it sooner."

A useful RCA technique is the "5 Whys":

  • Why did the service go down? The database connection pool was exhausted.
  • Why was the pool exhausted? A background job was holding connections open without releasing them.
  • Why wasn't the job releasing connections? The connection release logic had a bug in the error handling path.
  • Why wasn't the bug caught? The error path wasn't covered by integration tests.
  • Why wasn't it covered by tests? The testing guidelines don't require error path coverage for background jobs.

The root cause isn't "the database went down." It's "our testing guidelines have a gap for background job error handling." The first is a symptom. The second is actionable.

4. Action Items

Each action item should have:

  • A clear, specific description (not vague improvements)
  • An owner (a person, not a team)
  • A due date (within 2-4 weeks — longer timelines rarely get done)
  • A priority level
  • A link to the tracking ticket (Jira, Linear, GitHub Issue)

Good action items:

ActionOwnerDue DateStatus
Add connection release test for background job error pathsJane2 weeksOpen
Add connection pool exhaustion alert (threshold: 80%)Bob1 weekOpen
Update testing guidelines to require error path coverageAlice3 weeksOpen
Add connection pool metrics to Grafana dashboardDave1 weekOpen

Bad action items:

  • "Improve monitoring" (too vague)
  • "Be more careful with database connections" (not actionable)
  • "Look into better testing" (no owner, no deadline)

5. Lessons Learned

What did the team learn that applies beyond this specific incident? This section captures systemic insights:

  • Process gaps ("We didn't have a way to quickly identify which background jobs were running")
  • Tool gaps ("Our monitoring didn't alert on connection pool usage")
  • Knowledge gaps ("The team didn't know the background job framework's connection behavior")
  • Communication gaps ("The customer support team wasn't notified until 30 minutes into the incident")

Attaching RCAs to Incidents

The biggest improvement you can make to your post-mortem process is linking the RCA directly to the incident record in your incident management system.

When RCAs live inside the incident:

  • Future incidents from the same source automatically surface related RCAs
  • Action item completion is tracked alongside the incident
  • Recurring incidents are immediately obvious ("this is the third time this source has triggered an alert, and none of the previous action items were completed")
  • The RCA is discoverable through the alert timeline, not buried in a document management system

This is fundamentally different from writing a Google Doc and linking to it. The RCA becomes part of the incident's permanent record, visible to anyone who looks at that alert source in the future.

Building Blameless Culture

Blameless post-mortems aren't about avoiding accountability. They're about creating an environment where people provide complete information.

Practical guidelines for blameless post-mortems:

Focus on systems, not people. Instead of "Alice deployed without testing," write "the deployment process allowed changes without passing integration tests."

Assume good intent. Everyone involved was trying to do the right thing with the information they had. If someone made a decision that contributed to the incident, ask what information or tooling would have led to a different decision.

No spectators. Everyone in the post-mortem should have context. Don't invite people just for visibility — share the write-up instead.

Timebox strictly. 45-60 minutes maximum. If you can't cover everything, schedule a follow-up for specific technical deep-dives.

Assign a facilitator. Someone who wasn't directly involved in the incident should facilitate. This prevents the discussion from becoming a defense of past decisions.

From Post-Mortem to Prevention

The real measure of a post-mortem isn't the quality of the document. It's whether the same incident happens again.

Track these metrics:

  • Action item completion rate (target: 90%+ within 4 weeks)
  • Repeat incident rate (same root cause within 6 months)
  • Time from incident close to post-mortem completion (target: within 5 business days)
  • Number of action items per post-mortem (3-5 is ideal; more suggests scope creep)

Close the Loop

OpShift includes RCA tracking attached directly to alerts and incidents. When you write a root cause analysis, it lives with the incident — not in a separate document. Action items are tracked, and recurring incidents from the same source surface previous RCAs automatically.

Combined with alert grouping, occurrence tracking, and activity timelines, OpShift gives your team a complete picture of each incident's lifecycle. $14/month for up to 50 users. No per-seat pricing. Start at opshift.io.

Enjoyed this article?

Explore our courses to master system design and ace your next interview.