So you've got multiple nodes in your kubernetes cluster, you throw a bunch of workloads in there, and Kubernetes schedules the workloads onto the nodes, making sensible choices based on load, affinity, etc.
Note that this scheduling only happens when a pod is created. Once a pod has been scheduled to a node, Kubernetes won't take it away from that node. This can result in "sub-optimal" node loading, especially if you're elastically expanding your nodes themselves, or working through rolling updates.
Descheduler is used to rebalance clusters by evicting pods that can potentially be scheduled on better nodes.
Here are some reasons you might need to rebalance your cluster:
Some nodes are under or over utilized.
The original scheduling decision does not hold true any more, as taints or labels are added to or removed from nodes, pod/node affinity requirements are not satisfied any more.
Some nodes failed and their pods moved to other nodes.
New nodes are added to clusters.
Descheduler works by "kicking out" (evicting) certain nodes based on a policy you feed it, depending what you want to achieve. (You may want to converge as many pods as possible on as few nodes as possible, or more evenly distribute load across a static set of nodes)
We need a namespace to deploy our HelmRelease and associated YAMLs into. Per the flux design, I create this example yaml in my flux repo at /bootstrap/namespaces/namespace-descheduler.yaml:
We're going to install the Descheduler helm chart from the descheduler repository, so I create the following in my flux repo (assuming it doesn't already exist):
Now that the "global" elements of this deployment (just the HelmRepository in this case) have been defined, we do some "flux-ception", and go one layer deeper, adding another Kustomization, telling flux to deploy any YAMLs found in the repo at /descheduler/. I create this example Kustomization in my flux repo:
apiVersion:kustomize.toolkit.fluxcd.io/v1beta2kind:Kustomizationmetadata:name:deschedulernamespace:flux-systemspec:interval:30mpath:./deschedulerprune:true# remove any elements later removed from the above pathtimeout:10m# if not set, this defaults to interval duration, which is 1hsourceRef:kind:GitRepositoryname:flux-systemhealthChecks:-apiVersion:helm.toolkit.fluxcd.io/v2beta1kind:HelmReleasename:deschedulernamespace:descheduler
Fast-track your fluxing! 🚀
Is crafting all these YAMLs by hand too much of a PITA?
"Premix" is a git repository, which includes an ansible playbook to auto-create all the necessary files in your flux repository, for each chosen recipe!
Let the machines do the TOIL!
Descheduler HelmRelease
Lastly, having set the scene above, we define the HelmRelease which will actually deploy descheduler into the cluster. We start with a basic HelmRelease YAML, like this example:
/descheduler/helmrelease-descheduler.yaml
apiVersion:helm.toolkit.fluxcd.io/v2beta1kind:HelmReleasemetadata:name:deschedulernamespace:deschedulerspec:chart:spec:chart:deschedulerversion:0.27.x# auto-update to semver bugfixes only sourceRef:kind:HelmRepositoryname:deschedulernamespace:flux-systeminterval:15mtimeout:5mreleaseName:deschedulervalues:# paste contents of upstream values.yaml below, indented 4 spaces
If we deploy this helmrelease as-is, we'll inherit every default from the upstream Descheduler helm chart. That's probably hardly ever what we want to do, so my preference is to take the entire contents of the Descheduler helm chart's values.yaml, and to paste these (indented), under the values key. This means that I can then make my own changes in the context of the entire values.yaml, rather than cherry-picking just the items I want to change, to make future chart upgrades simpler.
Why not put values in a separate ConfigMap?
Didn't you previously advise to put helm chart values into a separate ConfigMap?
Yes, I did. And in practice, I've changed my mind.
Why? Because having the helm values directly in the HelmRelease offers the following advantages:
If you use the YAML extension in VSCode, you'll see a full path to the YAML elements, which can make grokking complex charts easier.
When flux detects a change to a value in a HelmRelease, this forces an immediate reconciliation of the HelmRelease, as opposed to the ConfigMap solution, which requires waiting on the next scheduled reconciliation.
Renovate can parse HelmRelease YAMLs and create PRs when they contain docker image references which can be updated.
In practice, adapting a HelmRelease to match upstream chart changes is no different to adapting a ConfigMap, and so there's no real benefit to splitting the chart values into a separate ConfigMap, IMO.
Then work your way through the values you pasted, and change any which are specific to your configuration.
Install Descheduler!
Commit the changes to your flux repository, and either wait for the reconciliation interval, or force a reconcilliation using flux reconcile source git flux-system. You should see the kustomization appear...
The following sections detail suggested changes to the values pasted into /descheduler/helmrelease-descheduler.yaml from the Descheduler helm chart's values.yaml. The values are already indented correctly to be copied, pasted into the HelmRelease, and adjusted as necessary.
Tip
Confusingly, the descheduler helm chart defaults to having the bundled redis and postgresql disabled, but the descheduler Kubernetes install docs require that they be enabled. Take care to change the respective enabled: false values to enabled: true below.
Set descheduler secret key
Create your admin user
Summary
What have we achieved? We've got descheduler running and accessible, we've created a superuser account, and we're ready to flex the power of descheduler to deploy an OIDC provider for Kubernetes, or simply secure unprotected UIs with proxy outposts!
Summary
Created:
descheduler running and ready to "deschedulerate" !
Yes, the lower-case thing bothers me too. That's how the official docs do it though, so I'm following suit. ↩
Tip your waiter (sponsor) 👏
Did you receive excellent service? Want to compliment the chef? (..and support development of current and future recipes!) Sponsor me on Github / Ko-Fi / Patreon, or see the contribute page for more (free or paid) ways to say thank you! 👏
Employ your chef (engage) 🤝
Is this too much of a geeky PITA? Do you just want results, stat? I do this for a living - I'm a full-time Kubernetes contractor, providing consulting and engineering expertise to businesses needing short-term, short-notice support in the cloud-native space, including AWS/Azure/GKE, Kubernetes, CI/CD and automation.
Want to know now when this recipe gets updated, or when future recipes are added? Subscribe to the RSS feed, or leave your email address below, and we'll keep you updated.