Blog - System Design Insights & Guides | System Design Interview

The Post-Mortem Problem

Every engineering organization runs post-mortems. Very few do them well.

The typical post-mortem cycle looks like this:

Major incident occurs
Team scrambles to resolve it
Someone schedules a post-mortem meeting for the following week
The meeting happens (sometimes)
A document is written with "root cause" and "action items"
The document is filed in Google Drive or Confluence
The action items never get done
The same incident happens again three months later
During the second post-mortem, someone finds the first post-mortem document
Everyone is embarrassed

This isn't a discipline problem. It's a structural problem. Post-mortems fail because they're disconnected from the incidents they analyze and the systems that track follow-through.

Why Most Post-Mortems Fail

The Document Graveyard

Post-mortem documents written in Google Docs or Confluence are effectively write-once files. They get shared in a Slack channel, a few people read them, and then they join thousands of other documents in a shared drive nobody browses.

The information is technically accessible but practically invisible. When a similar incident occurs months later, nobody thinks to search for the old post-mortem.

The Action Item Black Hole

Action items from post-mortems compete with feature work, bug fixes, and other priorities. Without a forcing function — a system that surfaces incomplete action items and reminds owners — they quietly drop off the priority list.

"Improve monitoring for database connection pools" sounds important in the post-mortem meeting. Two weeks later, it's item #47 on a backlog nobody reviews.

The Blame Problem

Despite widespread agreement that post-mortems should be "blameless," many organizations still conduct them in ways that make individuals feel defensive. When people feel like they're being evaluated, they provide less information, not more.

The result: post-mortems that describe what happened but not why, because the "why" might point to a person's decision rather than a system's failure.

Anatomy of a Useful Post-Mortem

A post-mortem that actually prevents repeat failures has five sections, each serving a specific purpose.

1. Incident Summary

A brief, factual description of what happened:

What service(s) were affected
Duration of impact
User/customer impact (specific numbers if possible)
Severity level
Who was involved in the response

Keep this to 3-5 sentences. The summary should be understandable by someone unfamiliar with the system.

2. Timeline

A chronological record of events, including:

When the issue started (detected vs. actual start)
When it was detected (and by what — monitoring, customer report, Slack message)
Key investigation steps and pivots
When the root cause was identified
When the fix was implemented
When service was fully restored
When the incident was officially closed

The timeline should come from your incident management tool's activity log, not from memory. Reconstructing timelines from memory introduces bias and inaccuracy.

3. Root Cause Analysis (RCA)

The RCA is the core of the post-mortem. It answers not just "what broke" but "why did it break" and "why didn't we catch it sooner."

A useful RCA technique is the "5 Whys":

Why did the service go down? The database connection pool was exhausted.
Why was the pool exhausted? A background job was holding connections open without releasing them.
Why wasn't the job releasing connections? The connection release logic had a bug in the error handling path.
Why wasn't the bug caught? The error path wasn't covered by integration tests.
Why wasn't it covered by tests? The testing guidelines don't require error path coverage for background jobs.

The root cause isn't "the database went down." It's "our testing guidelines have a gap for background job error handling." The first is a symptom. The second is actionable.

4. Action Items

Each action item should have:

A clear, specific description (not vague improvements)
An owner (a person, not a team)
A due date (within 2-4 weeks — longer timelines rarely get done)
A priority level
A link to the tracking ticket (Jira, Linear, GitHub Issue)

Good action items:

Action	Owner	Due Date	Status
Add connection release test for background job error paths	Jane	2 weeks	Open
Add connection pool exhaustion alert (threshold: 80%)	Bob	1 week	Open
Update testing guidelines to require error path coverage	Alice	3 weeks	Open
Add connection pool metrics to Grafana dashboard	Dave	1 week	Open

Bad action items:

"Improve monitoring" (too vague)
"Be more careful with database connections" (not actionable)
"Look into better testing" (no owner, no deadline)

5. Lessons Learned

What did the team learn that applies beyond this specific incident? This section captures systemic insights:

Process gaps ("We didn't have a way to quickly identify which background jobs were running")
Tool gaps ("Our monitoring didn't alert on connection pool usage")
Knowledge gaps ("The team didn't know the background job framework's connection behavior")
Communication gaps ("The customer support team wasn't notified until 30 minutes into the incident")

Attaching RCAs to Incidents

The biggest improvement you can make to your post-mortem process is linking the RCA directly to the incident record in your incident management system.

When RCAs live inside the incident:

Future incidents from the same source automatically surface related RCAs
Action item completion is tracked alongside the incident
Recurring incidents are immediately obvious ("this is the third time this source has triggered an alert, and none of the previous action items were completed")
The RCA is discoverable through the alert timeline, not buried in a document management system

This is fundamentally different from writing a Google Doc and linking to it. The RCA becomes part of the incident's permanent record, visible to anyone who looks at that alert source in the future.

Building Blameless Culture

Blameless post-mortems aren't about avoiding accountability. They're about creating an environment where people provide complete information.

Practical guidelines for blameless post-mortems:

Focus on systems, not people. Instead of "Alice deployed without testing," write "the deployment process allowed changes without passing integration tests."

Assume good intent. Everyone involved was trying to do the right thing with the information they had. If someone made a decision that contributed to the incident, ask what information or tooling would have led to a different decision.

No spectators. Everyone in the post-mortem should have context. Don't invite people just for visibility — share the write-up instead.

Timebox strictly. 45-60 minutes maximum. If you can't cover everything, schedule a follow-up for specific technical deep-dives.

Assign a facilitator. Someone who wasn't directly involved in the incident should facilitate. This prevents the discussion from becoming a defense of past decisions.

From Post-Mortem to Prevention

The real measure of a post-mortem isn't the quality of the document. It's whether the same incident happens again.

Track these metrics:

Action item completion rate (target: 90%+ within 4 weeks)
Repeat incident rate (same root cause within 6 months)
Time from incident close to post-mortem completion (target: within 5 business days)
Number of action items per post-mortem (3-5 is ideal; more suggests scope creep)

Close the Loop

OpShift includes RCA tracking attached directly to alerts and incidents. When you write a root cause analysis, it lives with the incident — not in a separate document. Action items are tracked, and recurring incidents from the same source surface previous RCAs automatically.

Combined with alert grouping, occurrence tracking, and activity timelines, OpShift gives your team a complete picture of each incident's lifecycle. $14/month for up to 50 users. No per-seat pricing. Start at opshift.io.

How to Run Incident Post-Mortems That Actually Prevent Repeat Failures