Balancing a Kubernetes cluster with descheduler

So you've got multiple nodes in your kubernetes cluster, you throw a bunch of workloads in there, and Kubernetes schedules the workloads onto the nodes, making sensible choices based on load, affinity, etc.

Note that this scheduling only happens when a pod is created. Once a pod has been scheduled to a node, Kubernetes won't take it away from that node. This can result in "sub-optimal" node loading, especially if you're elastically expanding your nodes themselves, or working through rolling updates.

Descheduler is used to rebalance clusters by evicting pods that can potentially be scheduled on better nodes.

descheduler login

Here are some reasons you might need to rebalance your cluster:

Some nodes are under or over utilized.
The original scheduling decision does not hold true any more, as taints or labels are added to or removed from nodes, pod/node affinity requirements are not satisfied any more.
Some nodes failed and their pods moved to other nodes.
New nodes are added to clusters.

Descheduler works by "kicking out" (evicting) certain nodes based on a policy you feed it, depending what you want to achieve. (You may want to converge as many pods as possible on as few nodes as possible, or more evenly distribute load across a static set of nodes)

Descheduler requirements

Ingredients

Already deployed:

A Kubernetes cluster
Flux deployment process bootstrapped

Preparation

Descheduler Namespace

We need a namespace to deploy our HelmRelease and associated YAMLs into. Per the flux design, I create this example yaml in my flux repo at /bootstrap/namespaces/namespace-descheduler.yaml:

/bootstrap/namespaces/namespace-descheduler.yaml


apiVersion: v1
kind: Namespace
metadata:
  name: descheduler

Descheduler HelmRepository

We're going to install the Descheduler helm chart from the descheduler repository, so I create the following in my flux repo (assuming it doesn't already exist):

/bootstrap/helmrepositories/helmrepository-descheduler.yaml


apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: HelmRepository
metadata:
  name: descheduler
  namespace: flux-system
spec:
  interval: 15m
  url: https://kubernetes-sigs.github.io/descheduler/

Descheduler Kustomization

Now that the "global" elements of this deployment (just the HelmRepository in this case) have been defined, we do some "flux-ception", and go one layer deeper, adding another Kustomization, telling flux to deploy any YAMLs found in the repo at /descheduler/. I create this example Kustomization in my flux repo:

/bootstrap/kustomizations/kustomization-descheduler.yaml


apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
  name: descheduler
  namespace: flux-system
spec:
  interval: 30m
  path: ./descheduler
  prune: true # remove any elements later removed from the above path
  timeout: 10m # if not set, this defaults to interval duration, which is 1h
  sourceRef:
    kind: GitRepository
    name: flux-system
  healthChecks:
    - apiVersion: helm.toolkit.fluxcd.io/v2beta1
      kind: HelmRelease
      name: descheduler
      namespace: descheduler

Fast-track your fluxing! 🚀

Is crafting all these YAMLs by hand too much of a PITA?

"Premix" is a git repository, which includes an ansible playbook to auto-create all the necessary files in your flux repository, for each chosen recipe!

Let the machines do the TOIL! 🏋️‍♂️

Descheduler HelmRelease

Lastly, having set the scene above, we define the HelmRelease which will actually deploy descheduler into the cluster. We start with a basic HelmRelease YAML, like this example:

/descheduler/helmrelease-descheduler.yaml


apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: descheduler
  namespace: descheduler
spec:
  chart:
    spec:
      chart: descheduler
      version: 0.27.x # auto-update to semver bugfixes only 
      sourceRef:
        kind: HelmRepository
        name: descheduler
        namespace: flux-system
  interval: 15m
  timeout: 5m
  releaseName: descheduler
  values: # paste contents of upstream values.yaml below, indented 4 spaces

If we deploy this helmrelease as-is, we'll inherit every default from the upstream Descheduler helm chart. That's probably hardly ever what we want to do, so my preference is to take the entire contents of the Descheduler helm chart's values.yaml, and to paste these (indented), under the values key. This means that I can then make my own changes in the context of the entire values.yaml, rather than cherry-picking just the items I want to change, to make future chart upgrades simpler.

Why not put values in a separate ConfigMap?

Didn't you previously advise to put helm chart values into a separate ConfigMap?

Yes, I did. And in practice, I've changed my mind.

Why? Because having the helm values directly in the HelmRelease offers the following advantages:

If you use the YAML extension in VSCode, you'll see a full path to the YAML elements, which can make grokking complex charts easier.
When flux detects a change to a value in a HelmRelease, this forces an immediate reconciliation of the HelmRelease, as opposed to the ConfigMap solution, which requires waiting on the next scheduled reconciliation.
Renovate can parse HelmRelease YAMLs and create PRs when they contain docker image references which can be updated.
In practice, adapting a HelmRelease to match upstream chart changes is no different to adapting a ConfigMap, and so there's no real benefit to splitting the chart values into a separate ConfigMap, IMO.

Then work your way through the values you pasted, and change any which are specific to your configuration.

Install Descheduler!

Commit the changes to your flux repository, and either wait for the reconciliation interval, or force a reconcilliation using flux reconcile source git flux-system. You should see the kustomization appear...


~ ❯ flux get kustomizations descheduler
NAME        READY   MESSAGE                         REVISION        SUSPENDED
descheduler True    Applied revision: main/70da637  main/70da637    False
~ ❯

The helmrelease should be reconciled...


~ ❯ flux get helmreleases -n descheduler descheduler
NAME        READY   MESSAGE                             REVISION    SUSPENDED
descheduler True    Release reconciliation succeeded    v0.27.x     False
~ ❯

And you should have happy pods in the descheduler namespace:


~ ❯ k get pods -n descheduler -l app.kubernetes.io/name=descheduler
NAME                                  READY   STATUS    RESTARTS   AGE
descheduler-7c94b7446d-nwsss   1/1     Running   0          5m14s
~ ❯

Configure descheduler Helm Chart

The following sections detail suggested changes to the values pasted into /descheduler/helmrelease-descheduler.yaml from the Descheduler helm chart's values.yaml. The values are already indented correctly to be copied, pasted into the HelmRelease, and adjusted as necessary.

Tip

Confusingly, the descheduler helm chart defaults to having the bundled redis and postgresql disabled, but the descheduler Kubernetes install docs require that they be enabled. Take care to change the respective enabled: false values to enabled: true below.

Set descheduler secret key

Create your admin user

Summary

What have we achieved? We've got descheduler running and accessible, we've created a superuser account, and we're ready to flex the power of descheduler to deploy an OIDC provider for Kubernetes, or simply secure unprotected UIs with proxy outposts!

Summary

Created:

descheduler running and ready to "deschedulerate" !

Configure Kubernetes OIDC authentication, unlocking production readiness as well as the Kubernetes Dashboard and Weave GitOps UIs (coming soon)

Chef's notes 📓

Yes, the lower-case thing bothers me too. That's how the official docs do it though, so I'm following suit. ↩

Did you receive excellent service? Want to compliment the chef? (..and support development of current and future recipes!) Sponsor me on Github / Ko-Fi / Patreon, or see the contribute page for more (free or paid) ways to say thank you! 👏

Employ your chef (engage) 🤝

Is this too much of a geeky PITA? Do you just want results, stat? I do this for a living - I'm a full-time Kubernetes contractor, providing consulting and engineering expertise to businesses needing short-term, short-notice support in the cloud-native space, including AWS/Azure/GKE, Kubernetes, CI/CD and automation.

Learn more about working with me here.

Want to know now when this recipe gets updated, or when future recipes are added? Subscribe to the RSS feed, or leave your email address below, and we'll keep you updated.

Balancing a Kubernetes cluster with descheduler

Descheduler requirements

Preparation

Descheduler Namespace

Descheduler HelmRepository

Descheduler Kustomization

Descheduler HelmRelease

Install Descheduler!

Configure descheduler Helm Chart

Set descheduler secret key

Create your admin user

Summary

Chef's notes 📓

Employ your chef (engage) 🤝

Notify me 🔔

Descheduler resources 📝

Your comments? 💬

Balancing a Kubernetes cluster with descheduler

Descheduler requirements

Preparation

Descheduler Namespace

Descheduler HelmRepository

Descheduler Kustomization

Descheduler HelmRelease

Install Descheduler!

Configure descheduler Helm Chart

Set descheduler secret key

Create your admin user

Summary

Chef's notes 📓

Tip your waiter (sponsor) 👏

Employ your chef (engage) 🤝

Flirt with waiter (subscribe) 💌

Notify me 🔔

Descheduler resources 📝

Your comments? 💬