Who’s Using Ceph RBD Mirroring for Kubernetes Storage in Production?

Alpha Feature in Production… Good Idea?

The most promising approach, journal-based mirroring, offers near real-time replication and faster failover. However, it’s currently an alpha feature in the Ceph CSI driver and relies on rbd-nbd, which introduces significant risks:

If the CSI plugin pods restart, it behaves like a node reboot—all RBD mount points vanish, disrupting workloads.
It’s more resource-intensive than the recommended krbd, consuming additional CPU and memory.

For documentation on Ceph RBD mirroring see IBM documentation

For details, see the rbd-nbd design proposal.

Why Snapshot-Based Mirroring Is the Safer Option

Until journal-based mirroring matures, snapshot-based mirroring is the only viable option for production. It works by taking periodic snapshots and replicating them to a standby Ceph cluster. However, it has limitations:

Data loss risk: Changes between snapshots are not captured.
Manual failover: Requires reinstalling the Ceph driver, recreating PVs, and promoting standby volumes to master.

While stable, it’s far from seamless. Learn more in the RBD mirroring design proposal.

Do you really need RBD mirroring ?

With rbd mirroring, the storage needs is 6x replica for a design with 2 zones (3 replica in each zone, min_size=2).

Another potent alternative when dealing with the constraint of 2 zones is to go for a 2.5 zones layout. A single ceph cluster spanning 3 zones for the control plane with limited resources footprint and no sensitive user data on this control plane combined with 2 zones for the storage plane. The advantage of this design is that it is then possible to sustain the loss of one zone by using replication=4 and min_size=2.

What’s Next?

The Ceph community is working on solutions like the RBD Volume Healer, which could simplify failover and recovery.

What’s Your Experience?

Are you using Ceph RBD mirroring in production? How are you handling its current limitations? Let’s discuss.

https://www.linkedin.com/

Why Snapshot-Based Mirroring Is the Safer Option

Do you really need RBD mirroring ?

What’s Next?

What’s Your Experience?

0 Comments

Leave a Reply Cancel reply

GPUs Changed Everything. Storage Is the Bottleneck Again.

Apache Spark smoke tests on kubernetes

The Hidden Costs of Using HDDs in On-Premises MinIO Deployments

Who’s Using Ceph RBD Mirroring for Kubernetes Storage in Production?

Published by jlu on 2025-12-152025-12-15

Why Snapshot-Based Mirroring Is the Safer Option

Do you really need RBD mirroring ?

What’s Next?

What’s Your Experience?

0 Comments

Leave a Reply Cancel reply

Related Posts

GPUs Changed Everything. Storage Is the Bottleneck Again.

Apache Spark smoke tests on kubernetes

The Hidden Costs of Using HDDs in On-Premises MinIO Deployments