That time when a Proxmox upgrade silently capped my MTU
I feed and water several Proxmox clusters, one of which was recently upgraded to PVE 7.3. This cluster runs VMs used to build a CI instance of a bare-metal Kubernetes cluster I support. Every day the CI cluster is automatically destroyed and rebuilt, to give assurance that our recent changes haven't introduced a failure which would prevent a re-install.
Since the PVE 7.3 upgrade, the CI cluster has been failing to build, because the out-of-cluster Vault instance we use to secure etcd secrets, failed to sync. After much debugging, I'd like to present a variation of a famous haiku1 to summarize the problem:
It's not MTU! There's no way it's MTU! It was MTU.
Here's how it went down...
Vault fails to sync
We're using Hashicorp vault in HA mode with integrated (raft) storage, and AWSKMS auto-unsealing. All you have to do is initialize vault on your first node, and then include something like this in other nodes:
When vault starts up, it'll look for the leaders in the retry_join config, attempt to connect to them, and use raft magic to unseal themselves and join the raft.
On our victim cluster, instead of happily joining the raft, the other nodes were logging messages like this:
Feb2200:38:04plumvault[32791]:2023-02-22T00:38:04.126Z[INFO]core:storedunsealkeyssupported,attemptingfetch
Feb2200:38:04plumvault[32791]:2023-02-22T00:38:04.126Z[WARN]failedtounsealcore:error="stored unseal keys are supported, but none were found"
Feb2200:38:04plumvault[32791]:2023-02-22T00:38:04.648Z[ERROR]core:failedtoretryjoinraftcluster:retry=2serr="failed to send answer to raft leader node: context deadline exceeded"
Feb2200:38:06plumvault[32791]:2023-02-22T00:38:06.648Z[INFO]core:securitybarriernotinitialized
Feb2200:38:06plumvault[32791]:2023-02-22T00:38:06.655Z[INFO]core:attemptingtojoinpossibleraftleadernode:leader_addr=https://192.168.20.11:8200
Feb2200:38:06plumvault[32791]:2023-02-22T00:38:06.655Z[INFO]core:attemptingtojoinpossibleraftleadernode:leader_addr=https://192.168.20.13:8200
Feb2200:38:06plumvault[32791]:2023-02-22T00:38:06.655Z[INFO]core:attemptingtojoinpossibleraftleadernode:leader_addr=https://192.168.20.12:8200
Feb2200:38:06plumvault[32791]:2023-02-22T00:38:06.657Z[ERROR]core:failedtogetraftchallenge:leader_addr=https://192.168.20.13:8200
Feb2200:38:06plumvault[32791]:error=
Feb2200:38:06plumvault[32791]:|errorduringraftbootstrapinitcall:ErrormakingAPIrequest.
Feb2200:38:06plumvault[32791]:|
Feb2200:38:06plumvault[32791]:|URL:PUThttps://192.168.20.13:8200/v1/sys/storage/raft/bootstrap/challenge
Feb2200:38:06plumvault[32791]:|Code:503.Errors:
Feb2200:38:06plumvault[32791]:|
Feb2200:38:06plumvault[32791]:|*Vaultissealed
Feb2200:38:06plumvault[32791]:
Note
In hindsight, the context deadline exceeded was a clue, but it was hidden in the noise of multiple nodes failing to join each other.
I discovered that an identical CI cluster on a non-upgrading proxmox didn't exhibit the error.
After exhausting all the conventional possibilites (vault cli version mismatch, SSL issues, DNS), I decided to check whether MTU was still working (this cluster had worked fine until recently).
Using ping -M do 4000 <target>, I was surprised to discover that my CI VMs could not ping each other with unfragmented, large payloads. I checked the working cluster - in that environment, I could pass large ping payloads.
It's MTU, right?
"Ha! The VMs MTUs must be set wrong!" - me
Nope. Checked that:
network:version:2ethernets:ens18:dhcp4:nooptional:truemtu:8894ens19:dhcp4:nooptional:truemtu:8894bonds:cluster:mtu:8894<snip the irrelevant stuff>
Proxmox upgrade broke MTU?
"OK, so maybe the proxmox upgrade has removed the MTU we set on the bridge." - also me
Aha, what is going on here? Every interface seems to have been set to an MTU of 1500.
In PVE 7.3, we now have the option to set the MTU of each network interface. This defaults to 1500, but by setting to the magic number of 1, the interface MTU will align with the MTU of its bridge.
So we should be able to just set each interface's MTU to 9000, right?
Well no. Not if we're using VLANs (which, of course we are, in a multi-networked replica of a real cluster, complete with multi-homed virtual firewalls!)
It seems that the addition of MTU settings to Proxmox virtio NICs via the UI has broken the way that MTU is set on the tap interfaces on the proxmox hypervisor.
If you use a VLAN tag, your MTU is fixed at 1500, regardless of what you set. If you don't use a VLAN tag, your MTU is set to the MTU of your bridge (the old behaviour), regardless of what you set.
I was hoping to be able to post a workaround here, a method to allow us to continue to use jumbo frames in our VLAN-aware CI cluster environment. Unfortunately, I've not been able to find a way to make this work, so until the bug is fixed, I've had to revert my CI clusters to a 1500 MTU, which now represents a minor deviation from our production clusters :(
Summary
MTU issues cause mysterious failures in mysterious ways. Always test your MTU using ping -M do -s <a big number> <target>, to ensure that you actually can pass larger-than-normal packets!
Did you receive excellent service? Want to compliment the chef? (..and support development of current and future recipes!) Sponsor me on Github / Ko-Fi / Patreon, or see the contribute page for more (free or paid) ways to say thank you! 👏
Employ your chef (engage) 🤝
Is this too much of a geeky PITA? Do you just want results, stat? I do this for a living - I'm a full-time Kubernetes contractor, providing consulting and engineering expertise to businesses needing short-term, short-notice support in the cloud-native space, including AWS/Azure/GKE, Kubernetes, CI/CD and automation.