Persistent storage in Kubernetes with Rook Ceph / CephFS - Cluster
Ceph is a highly-reliable, scalable network storage platform which uses individual disks across participating nodes to provide fault-tolerant storage.
Rook provides an operator for Ceph, decomposing the 10-year-old, at-time-arcane, platform into cloud-native components, created declaratively, whose lifecycle is managed by an operator.
In the previous recipe, we deployed the operator, and now to actually deploy a Ceph cluster, we need to deploy a custom resource (a "CephCluster"), which will instruct the operator on we'd like our cluster to be deployed.
We'll end up with multilpe storageClasses which we can use to allocate storage to pods from either Ceph RBD (block storage), or CephFS (a mounted filesystem). In many cases, CephFS is a useful choice, because it can be mounted from more than one pod at the same time, which makes it suitable for apps which need to share access to the same data (NZBGet, Sonarr, and Plex, for example)
Rook Ceph Cluster requirements
Ingredients
Already deployed:
- A Kubernetes cluster
- Flux deployment process bootstrapped
- Rook Ceph's Operator
Preparation
Namespace
We already deployed a rook-ceph
namespace when deploying the Rook Ceph Operator, so we don't need to create this again 1
HelmRepository
Likewise, we'll install the rook-ceph-cluster
helm chart from the same Rook-managed repository as we did the rook-ceph
(operator) chart, so we don't need to create a new HelmRepository.
Kustomization
We do, however, need a separate Kustomization for rook-ceph-cluster, telling flux to deploy any YAMLs found in the repo at /rook-ceph-cluster
. I create this example Kustomization in my flux repo:
Why a separate Kustomization if both are needed for rook-ceph?
While technically we could use the same Kustomization to deploy both rook-ceph
and rook-ceph-cluster
, we'd run into dependency issues. It's simpler and cleaner to deploy rook-ceph
first, and then list it as a dependency for rook-ceph-cluster
.
apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
name: rook-ceph-cluster--rook-ceph
namespace: flux-system
spec:
dependsOn:
- name: "rook-ceph" # (1)!
interval: 30m
path: ./rook-ceph-cluster
prune: true # remove any elements later removed from the above path
timeout: 10m # if not set, this defaults to interval duration, which is 1h
sourceRef:
kind: GitRepository
name: flux-system
- Note that we use the
spec.dependsOn
to ensure that this Kustomization is only applied after the rook-ceph operator is deployed and operational. This ensures that the necessary CRDs are in place, and avoids a dry-run error on the reconciliation.
Fast-track your fluxing! 🚀
Is crafting all these YAMLs by hand too much of a PITA?
"Premix" is a git repository, which includes an ansible playbook to auto-create all the necessary files in your flux repository, for each chosen recipe!
Let the machines do the TOIL!
ConfigMap
Now we're into the app-specific YAMLs. First, we create a ConfigMap, containing the entire contents of the helm chart's values.yaml. Paste the values into a values.yaml
key as illustrated below, indented 4 spaces (since they're "encapsulated" within the ConfigMap YAML). I create this example yaml in my flux repo:
apiVersion: v1
kind: ConfigMap
metadata:
name: rook-ceph-cluster-helm-chart-value-overrides
namespace: rook-ceph
data:
values.yaml: |- # (1)!
# <upstream values go here>
- Paste in the contents of the upstream
values.yaml
here, intended 4 spaces, and then change the values you need as illustrated below.
Here are some suggested changes to the defaults which you should consider:
toolbox:
enabled: true # (1)!
monitoring:
# enabling will also create RBAC rules to allow Operator to create ServiceMonitors
enabled: true # (2)!
# whether to create the prometheus rules
createPrometheusRules: true # (3)!
pspEnable: false # (4)!
ingress:
dashboard: {} # (5)!
- It's useful to have a "toolbox" pod to shell into to run ceph CLI commands
- Consider enabling if you already have Prometheus installed
- Consider enabling if you already have Prometheus installed
- PSPs are deprecated, and will eventually be removed in Kubernetes 1.25, at which point this will cause breakage.
- Customize the ingress configuration for your dashboard
Further to the above, decide which disks you want to dedicate to Ceph, and add to the cephClusterSpec
section.
The default configuration (below) will cause the operator to use any un-formatted disks found on any of your nodes. If this is what you want to happen, then you don't need to change anything.
cephClusterSpec:
storage: # cluster level storage configuration and selection
useAllNodes: true
useAllDevices: true
If you'd rather be a little more selective / declarative about which disks are used in a homogenous cluster, you could consider using deviceFilter
, like this:
cephClusterSpec:
storage: # cluster level storage configuration and selection
useAllNodes: true
useAllDevices: false
deviceFilter: sdc #(1)!
- A regex to use to filter target devices found on each node
If your cluster nodes are a little more snowflakey , here's a complex example:
cephClusterSpec:
storage: # cluster level storage configuration and selection
useAllNodes: false
useAllDevices: false
nodes:
- name: "teeny-tiny-node"
deviceFilter: "." #(1)!
- name: "bigass-node"
devices:
- name: "/dev/disk/by-path/pci-0000:01:00.0-sas-exp0x500404201f43b83f-phy11-lun-0" #(2)!
config:
metadataDevice: "/dev/osd-metadata/11"
- name: "nvme0n1" #(3)!
- name: "nvme1n1"
- Match any devices found on this node
- Match a very-specific device path, and pair this device with a faster device for OSD metadata
- Match devices with simple regex string matches
HelmRelease
Finally, having set the scene above, we define the HelmRelease which will actually deploy the rook-ceph operator into the cluster. I save this in my flux repo:
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: rook-ceph-cluster
namespace: rook-ceph
spec:
chart:
spec:
chart: rook-ceph-cluster
version: 1.9.x
sourceRef:
kind: HelmRepository
name: rook-release
namespace: flux-system
interval: 30m
timeout: 10m
install:
remediation:
retries: 3
upgrade:
remediation:
retries: -1 # keep trying to remediate
crds: CreateReplace # Upgrade CRDs on package update
releaseName: rook-ceph-cluster
valuesFrom:
- kind: ConfigMap
name: rook-ceph-cluster-helm-chart-value-overrides
valuesKey: values.yaml # (1)!
- This is the default, but best to be explicit for clarity
Install Rook Ceph Operator!
Commit the changes to your flux repository, and either wait for the reconciliation interval, or force a reconcilliation using flux reconcile source git flux-system
. You should see the kustomization appear...
~ ❯ flux get kustomizations rook-ceph-cluster
NAME READY MESSAGE REVISION SUSPENDED
rook-ceph-cluster True Applied revision: main/345ee5e main/345ee5e False
~ ❯
The helmrelease should be reconciled...
~ ❯ flux get helmreleases -n rook-ceph rook-ceph
NAME READY MESSAGE REVISION SUSPENDED
rook-ceph-cluster True Release reconciliation succeeded v1.9.9 False
~ ❯
And you should have happy rook-ceph operator pods:
~ ❯ k get pods -n rook-ceph -l app=rook-ceph-operator
NAME READY STATUS RESTARTS AGE
rook-ceph-operator-7c94b7446d-nwsss 1/1 Running 0 5m14s
~ ❯
To watch the operator do its magic, you can tail its logs, using:
k logs -n rook-ceph -f -l app=rook-ceph-operator
You can get or describe the status of your cephcluster:
~ ❯ k get cephclusters.ceph.rook.io -n rook-ceph
NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL
rook-ceph /var/lib/rook 3 6d22h Ready Cluster created successfully HEALTH_OK
~ ❯
How do I know it's working?
So we have a ceph cluster now, but how do we know we can actually provision volumes?
Create PVCs
Create two ceph-block PVCs (persistent volume claim), by running:
cat <<EOF | kubectl create -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ceph-block-pvc-1
labels:
test: ceph
funkypenguin-is: a-smartass
spec:
accessModes:
- ReadWriteOnce
storageClassName: ceph-block
resources:
requests:
storage: 128Mi
EOF
And:
cat <<EOF | kubectl create -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ceph-block-pvc-2
labels:
test: ceph
funkypenguin-is: a-smartass
spec:
accessModes:
- ReadWriteOnce
storageClassName: ceph-block
resources:
requests:
storage: 128Mi
EOF
Now create a ceph-filesystem (RWX) PVC, by running:
cat <<EOF | kubectl create -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ceph-filesystem-pvc
labels:
test: ceph
funkypenguin-is: a-smartass
spec:
accessModes:
- ReadWriteMany
storageClassName: ceph-filesystem
resources:
requests:
storage: 128Mi
EOF
Examine the PVCs by running:
kubectl get pvc -l test=ceph
Create Pod
Now create pods to consume the PVCs, by running:
cat <<EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
name: ceph-test-1
labels:
test: ceph
funkypenguin-is: a-smartass
spec:
containers:
- name: volume-test
image: nginx:stable-alpine
imagePullPolicy: IfNotPresent
volumeMounts:
- name: ceph-block-is-rwo
mountPath: /rwo
- name: ceph-filesystem-is-rwx
mountPath: /rwx
ports:
- containerPort: 80
volumes:
- name: ceph-block-is-rwo
persistentVolumeClaim:
claimName: ceph-block-pvc-1
EOF
And:
cat <<EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
name: ceph-test-2
labels:
test: ceph
funkypenguin-is: a-smartass
spec:
containers:
- name: volume-test
image: nginx:stable-alpine
imagePullPolicy: IfNotPresent
volumeMounts:
- name: ceph-block-is-rwo
mountPath: /rwo
- name: ceph-filesystem-is-rwx
mountPath: /rwx
ports:
- containerPort: 80
volumes:
- name: ceph-block-is-rwo
persistentVolumeClaim:
claimName: ceph-block-pvc-2
- name: ceph-filesystem-is-rwx
persistentVolumeClaim:
claimName: ceph-filesystem-pvc
EOF
Ensure the pods have started successfully (this indicates the PVCs were correctly attached) by running:
kubectl get pod -l test=ceph
Clean up
Assuming that the pod is in a Running
state, then TopoLVM is working!
Clean up your mess, little bare-metal-cave-monkey , by running:
kubectl delete pod -l funkypenguin-is=a-smartass
kubectl delete pvc -l funkypenguin-is=a-smartass #(1)!
- Label selectors are powerful!
View Ceph Dashboard
Assuming you have an Ingress Controller setup, and you've either picked a default IngressClass, or defined the dashboard ingress appropriately, you should be able to access your Ceph Dashboard, at the URL identified by the ingress (this is a good opportunity to check that the ingress deployed correctly):
~ ❯ k get ingress -n rook-ceph
NAME CLASS HOSTS ADDRESS PORTS AGE
rook-ceph-mgr-dashboard nginx rook-ceph.batcave.awesome.me 172.16.237.1 80, 443 177d
~ ❯
The dashboard credentials are automatically generated for you by the operator, and stored in a Kubernetes secret. To retrieve your credentials, run:
kubectl -n rook-ceph get secret rook-ceph-dashboard-password -o \
jsonpath="{['data']['password']}" | base64 --decode && echo
Summary
What have we achieved? We're half-way to getting a ceph cluster, having deployed the operator which will manage the lifecycle of the ceph cluster we're about to create!
Summary
Created:
- Ceph cluster has been deployed
- StorageClasses are available so that the cluster storage can be consumed by your pods
- Pretty graphs are viewable in the Ceph Dashboard
Chef's notes 📓
-
Unless you wanted to deploy your cluster components in a separate namespace to the operator, of course! ↩
Tip your waiter (sponsor) 👏
Did you receive excellent service? Want to compliment the chef? (..and support development of current and future recipes!) Sponsor me on Github / Ko-Fi / Patreon, or see the contribute page for more (free or paid) ways to say thank you! 👏
Employ your chef (engage) 🤝
Is this too much of a geeky PITA? Do you just want results, stat? I do this for a living - I'm a full-time Kubernetes contractor, providing consulting and engineering expertise to businesses needing short-term, short-notice support in the cloud-native space, including AWS/Azure/GKE, Kubernetes, CI/CD and automation.
Learn more about working with me here.
Flirt with waiter (subscribe) 💌
Want to know now when this recipe gets updated, or when future recipes are added? Subscribe to the RSS feed, or leave your email address below, and we'll keep you updated.