How a Kubernetes 1.24 upgrade broke Reddit for > 5h

In a previous role as a senior infrastructure architect, one of my responsibilities was to review and approve post-incident reports, and I've come to appreciate how valuable they can be to improve future reliability.

Nothing motivates positive change like the pain of an unplanned outage, which, when you dig deep enough, could have been entirely avoided had you made different choices in the past.

To keep myself sharp in this role, I would pick public post-mortems1, and attempt to analyze them for learnings and ideas that my team could use, without having to make the same mistakes first. I published some of these reviews on my blog, but since I've transitioned to consulting and away from SRE-focused roles, I've not kept up with my reading.

This week I read You Broke Reddit: The Pi-Day Outage, and decided to revive my old habit of reviewing and commenting on nice, juicy outage reports.

Let's get into it...