Skip to content

Persistent storage in Kubernetes with Rook Ceph / CephFS - Cluster

Ceph is a highly-reliable, scalable network storage platform which uses individual disks across participating nodes to provide fault-tolerant storage.

Rook provides an operator for Ceph, decomposing the 10-year-old, at-time-arcane, platform into cloud-native components, created declaratively, whose lifecycle is managed by an operator.

In the previous recipe, we deployed the operator, and now to actually deploy a Ceph cluster, we need to deploy a custom resource (a "CephCluster"), which will instruct the operator on we'd like our cluster to be deployed.

We'll end up with multilpe storageClasses which we can use to allocate storage to pods from either Ceph RBD (block storage), or CephFS (a mounted filesystem). In many cases, CephFS is a useful choice, because it can be mounted from more than one pod at the same time, which makes it suitable for apps which need to share access to the same data (NZBGet, Sonarr, and Plex, for example)

Rook Ceph Cluster requirements

Ingredients

Already deployed:

Preparation

Namespace

We already deployed a rook-ceph namespace when deploying the Rook Ceph Operator, so we don't need to create this again 👍 1

HelmRepository

Likewise, we'll install the rook-ceph-cluster helm chart from the same Rook-managed repository as we did the rook-ceph (operator) chart, so we don't need to create a new HelmRepository.

Kustomization

We do, however, need a separate Kustomization for rook-ceph-cluster, telling flux to deploy any YAMLs found in the repo at /rook-ceph-cluster. I create this example Kustomization in my flux repo:

Why a separate Kustomization if both are needed for rook-ceph?

While technically we could use the same Kustomization to deploy both rook-ceph and rook-ceph-cluster, we'd run into dependency issues. It's simpler and cleaner to deploy rook-ceph first, and then list it as a dependency for rook-ceph-cluster.

/bootstrap/kustomizations/kustomization-rook-ceph-cluster.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
  name: rook-ceph-cluster--rook-ceph
  namespace: flux-system
spec:
  dependsOn: 
  - name: "rook-ceph" # (1)!
  interval: 30m
  path: ./rook-ceph-cluster
  prune: true # remove any elements later removed from the above path
  timeout: 10m # if not set, this defaults to interval duration, which is 1h
  sourceRef:
    kind: GitRepository
    name: flux-system
  1. Note that we use the spec.dependsOn to ensure that this Kustomization is only applied after the rook-ceph operator is deployed and operational. This ensures that the necessary CRDs are in place, and avoids a dry-run error on the reconciliation.

Fast-track your fluxing! 🚀

Is crafting all these YAMLs by hand too much of a PITA?

"Premix" is a git repository, which includes an ansible playbook to auto-create all the necessary files in your flux repository, for each chosen recipe!

Let the machines do the TOIL! 🏋️‍♂️

ConfigMap

Now we're into the app-specific YAMLs. First, we create a ConfigMap, containing the entire contents of the helm chart's values.yaml. Paste the values into a values.yaml key as illustrated below, indented 4 spaces (since they're "encapsulated" within the ConfigMap YAML). I create this example yaml in my flux repo:

/rook-ceph-cluster/configmap-rook-ceph-cluster-helm-chart-value-overrides.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: rook-ceph-cluster-helm-chart-value-overrides
  namespace: rook-ceph
data:
  values.yaml: |-  # (1)!
    # <upstream values go here>
  1. Paste in the contents of the upstream values.yaml here, intended 4 spaces, and then change the values you need as illustrated below.

Here are some suggested changes to the defaults which you should consider:

toolbox:
  enabled: true # (1)!
monitoring:
  # enabling will also create RBAC rules to allow Operator to create ServiceMonitors
  enabled: true # (2)!
  # whether to create the prometheus rules
  createPrometheusRules: true # (3)!
pspEnable: false # (4)!
ingress:
  dashboard: {} # (5)!
  1. It's useful to have a "toolbox" pod to shell into to run ceph CLI commands
  2. Consider enabling if you already have Prometheus installed
  3. Consider enabling if you already have Prometheus installed
  4. PSPs are deprecated, and will eventually be removed in Kubernetes 1.25, at which point this will cause breakage.
  5. Customize the ingress configuration for your dashboard

Further to the above, decide which disks you want to dedicate to Ceph, and add to the cephClusterSpec section.

The default configuration (below) will cause the operator to use any un-formatted disks found on any of your nodes. If this is what you want to happen, then you don't need to change anything.

cephClusterSpec:
  storage: # cluster level storage configuration and selection
    useAllNodes: true
    useAllDevices: true

If you'd rather be a little more selective / declarative about which disks are used in a homogenous cluster, you could consider using deviceFilter, like this:

cephClusterSpec:
  storage: # cluster level storage configuration and selection
    useAllNodes: true
    useAllDevices: false
    deviceFilter: sdc #(1)!
  1. A regex to use to filter target devices found on each node

If your cluster nodes are a little more snowflakey ❄, here's a complex example:

cephClusterSpec:
  storage: # cluster level storage configuration and selection
    useAllNodes: false
    useAllDevices: false
    nodes:
    - name: "teeny-tiny-node"
      deviceFilter: "." #(1)!
    - name: "bigass-node"
      devices:
      - name: "/dev/disk/by-path/pci-0000:01:00.0-sas-exp0x500404201f43b83f-phy11-lun-0" #(2)!
        config:
          metadataDevice: "/dev/osd-metadata/11"
      - name: "nvme0n1" #(3)!
      - name: "nvme1n1"
  1. Match any devices found on this node
  2. Match a very-specific device path, and pair this device with a faster device for OSD metadata
  3. Match devices with simple regex string matches

HelmRelease

Finally, having set the scene above, we define the HelmRelease which will actually deploy the rook-ceph operator into the cluster. I save this in my flux repo:

/rook-ceph-cluster/helmrelease-rook-ceph-cluster.yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: rook-ceph-cluster
  namespace: rook-ceph
spec:
  chart:
    spec:
      chart: rook-ceph-cluster
      version: 1.9.x
      sourceRef:
        kind: HelmRepository
        name: rook-release
        namespace: flux-system
  interval: 30m
  timeout: 10m
  install:
    remediation:
      retries: 3
  upgrade:
    remediation:
      retries: -1 # keep trying to remediate
    crds: CreateReplace # Upgrade CRDs on package update
  releaseName: rook-ceph-cluster
  valuesFrom:
  - kind: ConfigMap
    name: rook-ceph-cluster-helm-chart-value-overrides
    valuesKey: values.yaml # (1)!
  1. This is the default, but best to be explicit for clarity

Install Rook Ceph Operator!

Commit the changes to your flux repository, and either wait for the reconciliation interval, or force a reconcilliation using flux reconcile source git flux-system. You should see the kustomization appear...

~  flux get kustomizations rook-ceph-cluster
NAME                READY   MESSAGE                         REVISION        SUSPENDED
rook-ceph-cluster   True    Applied revision: main/345ee5e  main/345ee5e    False
~ 

The helmrelease should be reconciled...

~  flux get helmreleases -n rook-ceph rook-ceph 
NAME                READY   MESSAGE                             REVISION    SUSPENDED
rook-ceph-cluster   True    Release reconciliation succeeded    v1.9.9      False
~ 

And you should have happy rook-ceph operator pods:

~  k get pods -n rook-ceph -l app=rook-ceph-operator
NAME                                  READY   STATUS    RESTARTS   AGE
rook-ceph-operator-7c94b7446d-nwsss   1/1     Running   0          5m14s
~ 

To watch the operator do its magic, you can tail its logs, using:

k logs -n rook-ceph -f -l app=rook-ceph-operator

You can get or describe the status of your cephcluster:

~  k get cephclusters.ceph.rook.io  -n rook-ceph
NAME        DATADIRHOSTPATH   MONCOUNT   AGE     PHASE   MESSAGE                        HEALTH      EXTERNAL
rook-ceph   /var/lib/rook     3          6d22h   Ready   Cluster created successfully   HEALTH_OK
~ 

How do I know it's working?

So we have a ceph cluster now, but how do we know we can actually provision volumes?

Create PVCs

Create two ceph-block PVCs (persistent volume claim), by running:

cat <<EOF | kubectl create -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ceph-block-pvc-1
  labels:
    test: ceph
    funkypenguin-is: a-smartass  
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: ceph-block
  resources:
    requests:
      storage: 128Mi
EOF

And:

cat <<EOF | kubectl create -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ceph-block-pvc-2
  labels:
    test: ceph
    funkypenguin-is: a-smartass  
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: ceph-block
  resources:
    requests:
      storage: 128Mi
EOF

Now create a ceph-filesystem (RWX) PVC, by running:

cat <<EOF | kubectl create -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ceph-filesystem-pvc
  labels:
    test: ceph
    funkypenguin-is: a-smartass  
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: ceph-filesystem
  resources:
    requests:
      storage: 128Mi
EOF

Examine the PVCs by running:

kubectl get pvc -l test=ceph

Create Pod

Now create pods to consume the PVCs, by running:

cat <<EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: ceph-test-1
  labels:
    test: ceph
    funkypenguin-is: a-smartass  
spec:
  containers:
  - name: volume-test
    image: nginx:stable-alpine
    imagePullPolicy: IfNotPresent
    volumeMounts:
    - name: ceph-block-is-rwo
      mountPath: /rwo
    - name: ceph-filesystem-is-rwx
      mountPath: /rwx
    ports:
    - containerPort: 80
  volumes:
  - name: ceph-block-is-rwo
    persistentVolumeClaim:
      claimName: ceph-block-pvc-1
EOF

And:

cat <<EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: ceph-test-2
  labels:
    test: ceph
    funkypenguin-is: a-smartass
spec:
  containers:
  - name: volume-test
    image: nginx:stable-alpine
    imagePullPolicy: IfNotPresent
    volumeMounts:
    - name: ceph-block-is-rwo
      mountPath: /rwo
    - name: ceph-filesystem-is-rwx
      mountPath: /rwx
    ports:
    - containerPort: 80
  volumes:
  - name: ceph-block-is-rwo
    persistentVolumeClaim:
      claimName: ceph-block-pvc-2
  - name: ceph-filesystem-is-rwx
    persistentVolumeClaim:
      claimName: ceph-filesystem-pvc     
EOF

Ensure the pods have started successfully (this indicates the PVCs were correctly attached) by running:

kubectl get pod -l test=ceph

Clean up

Assuming that the pod is in a Running state, then TopoLVM is working!

Clean up your mess, little bare-metal-cave-monkey 🐵, by running:

kubectl delete pod -l funkypenguin-is=a-smartass
kubectl delete pvc -l funkypenguin-is=a-smartass #(1)!
  1. Label selectors are powerful!

View Ceph Dashboard

Assuming you have an Ingress Controller setup, and you've either picked a default IngressClass, or defined the dashboard ingress appropriately, you should be able to access your Ceph Dashboard, at the URL identified by the ingress (this is a good opportunity to check that the ingress deployed correctly):

~  k get ingress -n rook-ceph
NAME                      CLASS   HOSTS                          ADDRESS        PORTS     AGE
rook-ceph-mgr-dashboard   nginx   rook-ceph.batcave.awesome.me   172.16.237.1   80, 443   177d
~ 

The dashboard credentials are automatically generated for you by the operator, and stored in a Kubernetes secret. To retrieve your credentials, run:

kubectl -n rook-ceph get secret rook-ceph-dashboard-password -o \
jsonpath="{['data']['password']}" | base64 --decode && echo

Summary

What have we achieved? We're half-way to getting a ceph cluster, having deployed the operator which will manage the lifecycle of the ceph cluster we're about to create!

Summary

Created:

  • Ceph cluster has been deployed
  • StorageClasses are available so that the cluster storage can be consumed by your pods
  • Pretty graphs are viewable in the Ceph Dashboard

Chef's notes 📓


  1. Unless you wanted to deploy your cluster components in a separate namespace to the operator, of course! 

Tip your waiter (sponsor) 👏

Did you receive excellent service? Want to compliment the chef? (..and support development of current and future recipes!) Sponsor me on Github / Ko-Fi / Patreon, or see the contribute page for more (free or paid) ways to say thank you! 👏

Employ your chef (engage) 🤝

Is this too much of a geeky PITA? Do you just want results, stat? I do this for a living - I'm a full-time Kubernetes contractor, providing consulting and engineering expertise to businesses needing short-term, short-notice support in the cloud-native space, including AWS/Azure/GKE, Kubernetes, CI/CD and automation.

Learn more about working with me here.

Flirt with waiter (subscribe) 💌

Want to know now when this recipe gets updated, or when future recipes are added? Subscribe to the RSS feed, or leave your email address below, and we'll keep you updated.

Your comments? 💬