Migrating Unmanaged OSDs to Managed OSDs in Ceph Squid

If you are running Ceph Squid with a mix of managed and unmanaged OSDs, there is one thing worth stating clearly upfront:

There is no in-place adoption mechanism in Squid.

You cannot “import” an existing OSD into the orchestrator. If an OSD was deployed manually or predates your orchestrator setup, the only supported path is remove and redeploy.

This is not obvious from the documentation, and it regularly surprises operators during upgrades or cleanup phases. The good news is that the migration is straightforward if you are methodical and patient.

This post documents the exact process I usually follow in production.

Why this matters

Unmanaged OSDs create long-term operational friction:

inconsistent lifecycle management
drift between hosts
harder upgrades
partial visibility in ceph orch

If you want reproducibility and predictable operations, all OSDs must be orchestrator-managed. There is no shortcut.

High-level approach

The migration consists of:

Identifying unmanaged OSDs and their backing devices
Gracefully evacuating data
Removing OSDs cleanly from the cluster
Fully wiping the underlying devices
Redeploying OSDs using a DriveGroup
Verifying cluster health and placement

This is destructive by design. Data safety depends entirely on waiting for rebalancing to complete before moving on.

1. Identify unmanaged OSDs

Start by listing OSD services and daemons:

ceph orch ls --service-type osd
ceph orch ps --daemon-type osd

Cross-check with the CRUSH tree:

ceph osd tree

To map OSD IDs to devices and hosts:

ceph device ls
ceph osd metadata <osd-id> | grep -e '"hostname"' -e '"devices"'

On a given host:

listosd=$(ceph device ls |grep $h |grep osd | awk -F '.' '{print $NF}' | awk '{print $1}' | sort -n)
osd_devices=$(ceph device ls |grep $h |grep osd | awk -F "${h}:" '{print $NF}' | awk '{print $1}' | sort -n)

for osdid in ${listosd}; do
  ceph osd metadata ${osdid} | grep -e '"hostname"' -e '"devices"'
done

Take your time here. Any mistake propagates later.

2. Mark OSDs out and wait

Mark OSDs out and wait until the cluster is fully healthy again:

for osdid in ${listosd}; do
  ceph osd out ${osdid}
done

Do not proceed until:

no degraded objects
no reduced data availability

Patience here saves incidents later.

3. Mark OSDs down

Once data has migrated:

for osdid in ${listosd}; do
  ceph osd down ${osdid}
done

4. Stop OSD daemons on the host

On each affected host:

systemctl list-units --no-legend --no-pager "ceph-*@osd.*"

for osdid in ${listosd}; do
  systemctl stop ceph-${clusterid}@osd.${osdid}
done

5. Remove OSDs from the cluster

Remove the OSDs definitively:

for osdid in ${listosd}; do
  ceph osd purge ${osdid} --yes-i-really-mean-it
done

for osdid in ${listosd}; do
  ceph orch daemon rm osd.${osdid} --force
done

At this point, the cluster no longer knows about these OSDs.

6. Clean the devices (properly)

This step is where most redeployments fail.

Basic zap:

ceph-volume lvm zap /dev/sdX --destroy
# or
sgdisk --zap-all /dev/sdX

For multiple devices:

for d in ${osd_devices}; do
  sgdisk --zap-all /dev/${d}
done

If the zap is not sufficient, nuke the drives:

for dm in $(dmsetup ls | grep '^ceph--' | awk '{print $1}'); do
  dmsetup remove ${dm}
done

for vg in $(vgdisplay -s | grep '"ceph-' | awk '{print $1}' | tr -d '"'); do
  vgremove --force $vg
done

for d in ${osd_devices}; do
  pvremove /dev/${d}
  wipefs -af /dev/${d}
  sfdisk --delete /dev/${d}
  partprobe /dev/${d}
done

If Ceph sees any leftover metadata, the redeploy may silently fail.

7. Redeploy using a DriveGroup

Create a DriveGroup YAML per host:

cat << EOF > ceph-osd-${h}.yaml
service_type: osd
service_id: osd-${h}
placement:
  hosts:
    - ${h}
data_devices:
  paths:
EOF

for dev in ${osd_devices}; do
  echo "    - /dev/${dev}" >> ceph-osd-${h}.yaml
done

Apply it:

ceph orch apply -i ceph-osd-${h}.yaml

Optional: set a device class if needed:

mydevclass=nvme-all
listosd=$(ceph device ls |grep $h |grep osd | awk -F '.' '{print $NF}' | awk '{print $1}' | sort -n)

for osdid in ${listosd}; do
  ceph osd crush set-device-class ${mydevclass} ${osdid}
done

8. Verify

Final checks:

ceph orch ps --daemon-type osd
ceph osd tree
ceph -s

You should now have fully managed OSDs with consistent lifecycle control.

Final notes

There is no shortcut in Squid today
Plan the maintenance window
Double-check devices before wiping
Wait for full rebalance at every step

This process is boring, repetitive, and safe — exactly what you want when touching storage.

Comments on linkedin

Migrating Unmanaged OSDs to Managed OSDs in Ceph Squid

Published by jlu on 2026-02-052026-02-05

Why this matters

High-level approach

1. Identify unmanaged OSDs

2. Mark OSDs out and wait

3. Mark OSDs down

4. Stop OSD daemons on the host

5. Remove OSDs from the cluster

6. Clean the devices (properly)

7. Redeploy using a DriveGroup

8. Verify

Final notes

0 Comments

Leave a Reply Cancel reply

Tesla’s .SMOL Format Shows Why Most Enterprise Data Lakes Are Architecturally Wrong

GPUs Changed Everything. Storage Is the Bottleneck Again.

Apache Spark smoke tests on kubernetes

Migrating Unmanaged OSDs to Managed OSDs in Ceph Squid

Published by jlu on 2026-02-052026-02-05

Why this matters

High-level approach

1. Identify unmanaged OSDs

2. Mark OSDs out and wait

3. Mark OSDs down

4. Stop OSD daemons on the host

5. Remove OSDs from the cluster

6. Clean the devices (properly)

7. Redeploy using a DriveGroup

8. Verify

Final notes

0 Comments

Leave a Reply Cancel reply

Related Posts

Tesla’s .SMOL Format Shows Why Most Enterprise Data Lakes Are Architecturally Wrong

GPUs Changed Everything. Storage Is the Bottleneck Again.

Apache Spark smoke tests on kubernetes