Lesson 11.3: Failure-Mode Reviews

The Anchor contact form outage in month 2 was the first failure-mode review. The instinct was to write “be more careful when deploying PHP changes” as the prevention measure. That was rejected. The actual prevention measure was a synthetic form check running every 15 minutes that would catch this category of failure in under 20 minutes — regardless of how careful the next PHP deploy is.

The situation

A PHP change introduced an edge case that caused the AJAX contact form endpoint to return an error for all requests. The site remained up and returned HTTP 200 on the homepage. Standard uptime monitoring didn’t fire. The owner found out when a customer called. The fix took 4 minutes. The exposure window was 3 hours. The post-mortem produced 3 prevention measures, all implemented within 2 weeks.

What I did

The failure-mode review template

Narrative accounts produce “be more careful.” Structured templates produce specific prevention measures. The template forces the question: what technical change would make this impossible next time?

Markdown — post-mortem template

# Incident Post-Mortem: [Short title]

**Date:** [YYYY-MM-DD]
**Duration:** [How long the incident lasted]
**Impact:** [What users/systems were affected and how]
**Severity:** P1 (site down) / P2 (feature broken) / P3 (degraded)

## Timeline (UTC)

[HH:MM] What happened
[HH:MM] When it was detected
[HH:MM] When the fix was deployed
[HH:MM] When the system was verified restored

## Root cause

One sentence. Not "code bug" — the specific condition that caused the failure.

Example: "OPCache was not flushed after a PHP file deploy; a function signature
change was not reflected in compiled bytecode, causing a fatal error on every
AJAX request to the contact endpoint."

## Contributing factors

What made this failure possible or harder to catch:
- No synthetic form check (would have caught this in < 20 min)
- OPCache flush not in deploy checklist (would have prevented this)
- No PHP error log alerting (would have caught this in < 30 min)

## Prevention measures

Each measure must be specific and testable. "Be more careful" is not a measure.

| Measure                                          | Owner | Due        | Status |
|--------------------------------------------------|-------|------------|--------|
| Add synthetic form check (15-min cron)           | dev   | 2026-03-15 | DONE   |
| Add OPCache flush to deploy checklist            | dev   | 2026-03-10 | DONE   |
| Add PHP error log scan + alert (30-min cron)     | dev   | 2026-03-20 | DONE   |

## What we won't change

Explicit list of things that looked like problems but aren't worth fixing.
Prevents scope creep in the remediation work.

Prevention measure quality standard

A prevention measure is specific enough to be implemented and tested. The test: can you check in 6 months whether the measure would have caught the original incident?

Quality standard — bad vs good

BAD:  "Review PHP changes more carefully before deploying"
GOOD: "PHP syntax check (`php -l`) on all modified files as pre-deploy checklist item"

BAD:  "Improve monitoring"
GOOD: "Synthetic form submission check running every 15 minutes;
      alert on non-200 or success:false"

BAD:  "Be more aware of OPCache"
GOOD: "opcache_reset() one-shot script in deploy sequence,
      verified by curl before deletion"

Filing the review

The completed post-mortem goes in the repository — not in a Notion database, not in a Google Doc. In the repository, version-controlled, findable by any future developer doing a grep:

Directory layout

docs/
  post-mortems/
    2026-02-contact-form-outage.md
    2026-04-inventory-sort-regression.md
    2026-07-email-sequence-wrong-sender.md

Why it matters

The prevention measure is where all the value is. Everything else in the post-mortem is context. A post-mortem without specific, implemented prevention measures is a documentation exercise. The measure converts an incident from “that happened once” to “that category of failure is now actively guarded against.”

The “what we won’t change” section prevents the psychological pressure to fix everything that looked bad during the incident. Not everything that looked bad is worth fixing. Scoping the remediation clearly lets the critical measures get implemented without being diluted by low-ROI cleanup work.

The Anchor build

3 post-mortems in 14 months (all P2 or higher). 7 prevention measures total, all implemented. 6 of the 7 covered failure categories were subsequently detected early — by the prevention measures themselves — before reaching production. The synthetic form check fired twice in 14 months: both times on PHP changes that would have caused the same 3-hour outage window as the original contact form incident.

Do this, not that

Require specific prevention measures, not behavioral commitments. “Be more careful” expires at the end of sprint. A synthetic check runs every 15 minutes indefinitely.
Do the post-mortem within 48 hours of the incident. The contributing factors are clearest while the incident is recent. A post-mortem 2 weeks later is an exercise in archaeology.
File post-mortems in the repository, not in project management tools. Jira tickets get closed. Notion pages drift. A file in docs/post-mortems/ is permanent and searchable by grep.
Include the “what we won’t change” section. Without it, every post-mortem expands into a refactor session. Explicit scope boundaries keep the remediation focused on what actually matters.

← Previous Lesson 11.2 — Refactor vs Ship Decisions Next → Lesson 11.4 — Billing Transparency

When you’re ready to build

The lessons are yours. When you want it built, we’re here.

Every lesson stays free — no account, no paywall, no email gate, ever. But if you’d rather have this system standing on your business than wire all 48 lessons yourself, leave your email. We’ll send you a direct line to a build — and you’ll be first to hear when we add new tools to the curriculum.

None of this gates a single lesson. The curriculum was free before you got here and it stays that way.

Done learning how it’s built? We’ll build it.

You came here to understand the system, and now you do. If you’d rather have it standing on your business than spend the next three months wiring it yourself, GAP Concierge is the same architecture from these lessons — a white-label AI agent that knows your catalog and captures your leads — set up for you, from $97/mo.

See GAP Concierge →