Skip to content

Funky Penguin's Geek Cookblog

How a Kubernetes 1.24 upgrade broke Reddit for > 5h

In a previous role as a senior infrastructure architect, one of my responsibilities was to review and approve post-incident reports, and I've come to appreciate how valuable they can be to improve future reliability.

Nothing motivates positive change like the pain of an unplanned outage, which, when you dig deep enough, could have been entirely avoided had you made different choices in the past.

To keep myself sharp in this role, I would pick public post-mortems1, and attempt to analyze them for learnings and ideas that my team could use, without having to make the same mistakes first. I published some of these reviews on my blog, but since I've transitioned to consulting and away from SRE-focused roles, I've not kept up with my reading.

This week I read You Broke Reddit: The Pi-Day Outage, and decided to revive my old habit of reviewing and commenting on nice, juicy outage reports.

Let's get into it...

When helm says "no" (failed to delete release)

My beloved "Penguin Patrol" bot, which I use to give GitHub / Patreon / Ko-Fi supporters access to the premix repo, was deployed on a Kube 1.19 Digital Ocean cluster, 3 years ago. At the time, the Ingress API was at v1beta1.

Fast-forward to today, and several Kubernetes major version upgrades later (it's on 1.23 currently, and we're on Ingress v1), and I discovered that I was unable to upgrade the chart, since helm complained that the previous release referred to deprecated APIs.

Worse, helm wouldn't let me delete and re-install the release - because of those damned deprecated APIs!

Here's how I fixed it...

Added recipe for Nomie (swarm)

Do you wish you had a chart showing your exercise, weight, or pooping 💩 trends over the past month? Nomie is a beautiful life/self-tracking app, an 8-year labor of love from developer Brandon Corbin.

Brandon has recently shut down the commercially hosted version of Nomie, but open-sourced all the code, so one of the geekier alternatives, buyoued by the still-passionate community of users, is to run your own Nomie instance...

That time when a Proxmox upgrade silently capped my MTU

I feed and water several Proxmox clusters, one of which was recently upgraded to PVE 7.3. This cluster runs VMs used to build a CI instance of a bare-metal Kubernetes cluster I support. Every day the CI cluster is automatically destroyed and rebuilt, to give assurance that our recent changes haven't introduced a failure which would prevent a re-install.

Since the PVE 7.3 upgrade, the CI cluster has been failing to build, because the out-of-cluster Vault instance we use to secure etcd secrets, failed to sync. After much debugging, I'd like to present a variation of a famous haiku1 to summarize the problem:

It's not MTU!
There's no way it's MTU!
It was MTU.

Here's how it went down...

Made changes to your CoreDNS deployment / images? You may find kubeadm uncooperative..

Are you trying to join a new control-plane node to a kubeadm-installed cluster, and seeing an error like this?

start version '8916c89e1538ea3941b58847e448a2c6d940c01b8e716b20423d2d8b189d3972' not supported
unable to get list of changes to the configuration.
k8s.io/kubernetes/cmd/kubeadm/app/phases/addons/dns.isCoreDNSConfigMapMigrationRequired

You've changed your CoreDNS deployment, haven't you? You're using a custom image, or an image digest, or you're using an admissionwebhook to mutate pods upon recreation?

Here's what it means, and how to work around it...

Added recipe for Invidious (swarm)

Are you tired of second-guessing the YouTube links your friends send you, afraid that you'll forever see weird videos recommended to you as a result? I found myself avoiding unknown links for this reason, and so deployed an instance of Invidious to act as a private, non-tracking frontend to YouTube..