A post-mortem that produces “we’ll be more careful next time” is a post-mortem that will produce the same incident again. The output that matters is a specific technical prevention measure that makes the same incident impossible — not just less likely.
The Anchor contact form outage in month 2 was the first failure-mode review. The instinct was to write “be more careful when deploying PHP changes” as the prevention measure. That was rejected. The actual prevention measure was a synthetic form check running every 15 minutes that would catch this category of failure in under 20 minutes — regardless of how careful the next PHP deploy is.
A PHP change introduced an edge case that caused the AJAX contact form endpoint to return an error for all requests. The site remained up and returned HTTP 200 on the homepage. Standard uptime monitoring didn’t fire. The owner found out when a customer called. The fix took 4 minutes. The exposure window was 3 hours. The post-mortem produced 3 prevention measures, all implemented within 2 weeks.
Narrative accounts produce “be more careful.” Structured templates produce specific prevention measures. The template forces the question: what technical change would make this impossible next time?
# Incident Post-Mortem: [Short title]
**Date:** [YYYY-MM-DD]
**Duration:** [How long the incident lasted]
**Impact:** [What users/systems were affected and how]
**Severity:** P1 (site down) / P2 (feature broken) / P3 (degraded)
## Timeline (UTC)
[HH:MM] What happened
[HH:MM] When it was detected
[HH:MM] When the fix was deployed
[HH:MM] When the system was verified restored
## Root cause
One sentence. Not "code bug" — the specific condition that caused the failure.
Example: "OPCache was not flushed after a PHP file deploy; a function signature
change was not reflected in compiled bytecode, causing a fatal error on every
AJAX request to the contact endpoint."
## Contributing factors
What made this failure possible or harder to catch:
- No synthetic form check (would have caught this in < 20 min)
- OPCache flush not in deploy checklist (would have prevented this)
- No PHP error log alerting (would have caught this in < 30 min)
## Prevention measures
Each measure must be specific and testable. "Be more careful" is not a measure.
| Measure | Owner | Due | Status |
|--------------------------------------------------|-------|------------|--------|
| Add synthetic form check (15-min cron) | dev | 2026-03-15 | DONE |
| Add OPCache flush to deploy checklist | dev | 2026-03-10 | DONE |
| Add PHP error log scan + alert (30-min cron) | dev | 2026-03-20 | DONE |
## What we won't change
Explicit list of things that looked like problems but aren't worth fixing.
Prevents scope creep in the remediation work.
A prevention measure is specific enough to be implemented and tested. The test: can you check in 6 months whether the measure would have caught the original incident?
BAD: "Review PHP changes more carefully before deploying"
GOOD: "PHP syntax check (`php -l`) on all modified files as pre-deploy checklist item"
BAD: "Improve monitoring"
GOOD: "Synthetic form submission check running every 15 minutes;
alert on non-200 or success:false"
BAD: "Be more aware of OPCache"
GOOD: "opcache_reset() one-shot script in deploy sequence,
verified by curl before deletion"
The completed post-mortem goes in the repository — not in a Notion database, not in a Google Doc. In the repository, version-controlled, findable by any future developer doing a grep:
docs/
post-mortems/
2026-02-contact-form-outage.md
2026-04-inventory-sort-regression.md
2026-07-email-sequence-wrong-sender.md
The prevention measure is where all the value is. Everything else in the post-mortem is context. A post-mortem without specific, implemented prevention measures is a documentation exercise. The measure converts an incident from “that happened once” to “that category of failure is now actively guarded against.”
The “what we won’t change” section prevents the psychological pressure to fix everything that looked bad during the incident. Not everything that looked bad is worth fixing. Scoping the remediation clearly lets the critical measures get implemented without being diluted by low-ROI cleanup work.
3 post-mortems in 14 months (all P2 or higher). 7 prevention measures total, all implemented. 6 of the 7 covered failure categories were subsequently detected early — by the prevention measures themselves — before reaching production. The synthetic form check fired twice in 14 months: both times on PHP changes that would have caused the same 3-hour outage window as the original contact form incident.
docs/post-mortems/ is permanent and searchable by grep.Every lesson stays free — no account, no paywall, no email gate, ever. But if you’d rather have this system standing on your business than wire all 48 lessons yourself, leave your email. We’ll send you a direct line to a build — and you’ll be first to hear when we add new tools to the curriculum.
None of this gates a single lesson. The curriculum was free before you got here and it stays that way.
You came here to understand the system, and now you do. If you’d rather have it standing on your business than spend the next three months wiring it yourself, GAP Concierge is the same architecture from these lessons — a white-label AI agent that knows your catalog and captures your leads — set up for you, from $97/mo.
See GAP Concierge →