Ceph RBD — Where does it store (meta)data?
Ceph offers different functionality on top of the RADOS object store. One of them is the rados block device (RBD) layer. It offers virtual block devices (disks) that are utilized by different hypervisors to store the disks of virtual machines.
How the RBD layer is using the underlying object storage to store actual data of the disk images and the necessary metadata is the topic of this blog post.
Disclaimer: This post sums up my current understanding, and I do not claim that it is 100% correct or exhaustive.
The following examples are done with Ceph installed on Proxmox VE in a hyperconverged manner.
We have one pool, called rbd which stores the disk image of one virtual machine.
By naming the pool rbd we can omit the pool name whenever we call the
rbd utility, as it is the default pool if no other is provided.
Getting information from RBD itself
rbd utility is the main CLI tool to interact with the RBD layer itself.
All the data it is showing us must be stored somewhere in the layers below.
Let's take a look.
The first thing that we can do, is to list all the images present:
root@cephtest:~# rbd ls vm-100-disk-0 vm-100-disk-1
Right now, this is the main disk of our test VM and a second, currently empty disk image that we will use to manually store some data for demonstrations purposes in a bit.
We can get more information on a specific image as well. Let's get them for the second disk image:
1 root@cephtest:~# rbd info vm-100-disk-1 2 rbd image 'vm-100-disk-1': 3 size 2 GiB in 512 objects 4 order 22 (4 MiB objects) 5 snapshot_count: 0 6 id: 16353c4794276b 7 block_name_prefix: rbd_data.16353c4794276b 8 format: 2 9 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten 10 op_features: 11 flags: 12 create_timestamp: Fri Feb 10 19:29:09 2023 13 access_timestamp: Fri Feb 10 20:02:37 2023 14 modify_timestamp: Fri Feb 10 19:29:09 2023
We see quite some information presented here:
- the size of the disk image (2 GiB)
- the number of objects in which those 2 GiB are stored in, 512 objects
- how much data each object holds (4 MiB)
- the number of snapshots (0)
- an ID:
- a prefix which seems to contain the ID:
rbd_data.16353c4794276b. This is the rados object prefix, as we will see just in a bit.
- enabled features
- no flags seem to be set
- timestamps for creation, access and modification times
But let's take a look at the layer below, using the
rados tool, to inspect the object store itself.
rados -p rbd ls to get a list of all the objects in the pool:
1 root@cephtest:~# rados -p rbd ls 2 rbd_header.163530b41c75e0 3 rbd_directory 4 rbd_id.vm-100-disk-1 5 rbd_info 6 rbd_object_map.16353c4794276b 7 rbd_object_map.163530b41c75e0 8 rbd_trash 9 rbd_header.16353c4794276b 10 rbd_id.vm-100-disk-0 11 rbd_data.163530b41c75e0.0000000000000000 12 rbd_data.163530b41c75e0.0000000000000001 13 rbd_data.163530b41c75e0.0000000000000002 14 rbd_data.163530b41c75e0.0000000000000009 15 rbd_data.163530b41c75e0.000000000000000a 16 rbd_data.163530b41c75e0.000000000000000b 17 rbd_data.163530b41c75e0.0000000000000020 18 rbd_data.163530b41c75e0.0000000000000021 19 [...]
Did you notice the
-p rbd parameter? We will always have to provide the pool name to the
As you can see, there are quite some objects that look like they might contain metadata.
rbd_info, and the
rbd_header are likely candidates.
But let's focus on the data objects for now before we dive into the metadata rabbit hole.
Let's look at the data objects belonging to the second disk image.
It has the ID
16353c4794276b, so the data object names should all start with
root@cephtest:~# rados -p rbd ls | grep rbd_data.16353c4794276b root@cephtest:~#
Looks like there are none available. This is understandable, as we haven't written anything to that disk yet, not even a partition table.
We can write some test data to the disk directly from within the VM:
root@testVM:~# echo FOOBAR > /dev/sdb root@testVM:~#
Then check again for data objects:
root@cephtest:~# rados -p rbd ls | grep rbd_data.16353c4794276b rbd_data.16353c4794276b.0000000000000000 root@cephtest:~#
As you can see, we got one data object with the ID
We can access it directly and either save the contents to a file, or directly to the CLI if the second parameter is
root@cephtest:~# rados -p rbd get rbd_data.16353c4794276b.0000000000000000 - FOOBAR root@cephtest:~#
As you can see, it contains the string we wrote to the disk from inside the VM. Another thing we can try is to copy the same data to another location on the disk:
root@testVM:~# dd if=/dev/sdb of=/dev/sdb bs=1M count=1 seek=512 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.076572 s, 13.7 MB/s root@testVM:~#
With this, we copied the first MiB of the disk to an offset of 512 MiB.
If we list all the
rbd_data objects for this disk image (I am piping it to
sort so that the objects are listed in the correct order):
root@cephtest:~# rados -p rbd ls | grep rbd_data.16353c4794276b | sort rbd_data.16353c4794276b.0000000000000000 rbd_data.16353c4794276b.0000000000000080 root@cephtest:~#
The second object has the ID
Since the ID is stored as hexadecimal, we can convert it to decimal where the ID is 128.
Given that we copied the data to an offset of 512 MiB, and that each RBD object contains 4 MiB, this is exactly the object ID matching to that location on the disk: 512 MiB / 4 MiB = 128.
Therefore, we can safely deduct, that the object ID for a particular disk image, corresponds with the location within the disk image.
If you run the
rados -p rbd get command on any of the
rbd_data objects on a disk image that contains real life data, it will show you the raw disk contents of that part of the disk.
In most situations, this will be binary data.
Printing it directly to the terminal, can lead to a messed up terminal.
You can try the
reset command in such a situation.
The other, non
rbd_data objects are the one we want to take a closer look now.
If you want to filter the output of all the
rbd_data objects, inverting the
grep output helps, for example
rados -p rbd ls | grep -v rbd_data.
1 root@cephtest:~# rados -p rbd ls | grep -v rbd_data 2 rbd_header.163530b41c75e0 3 rbd_directory 4 rbd_id.vm-100-disk-1 5 rbd_info 6 rbd_object_map.16353c4794276b 7 rbd_object_map.163530b41c75e0 8 rbd_trash 9 rbd_header.16353c4794276b 10 rbd_id.vm-100-disk-0
We seem to have some objects that contain the actual image name, and a few more that contain the ID that we saw in the
rbd info output earlier.
There are also a few objects that seem to be general rbd objects, as they don't have any image name or ID in their name.
Those would be, for example, the
We will ignore them for now.
We can take a look at the data stored in an object. What does the
rbd_id.vm-100-disk-0 object contain?
root@cephtest:~# rados -p rbd get rbd_id.vm-100-disk-0 - 163530b41c75e0root@cephtest:~#
Since there is nothing at the end of the object that the terminal would interpret as a newline, the prompt gets printed right after it, making it a bit hard to read.
But we can see that it contains the ID of the first disk image,
There is an object map for the ID present (
It contains binary data, which will look like a lot of gibberish when interpreted as text.
Don't expect any useful information when you print it directly to the CLI.
The object map stores which objects of an image actually exist and where.
This helps clients to determine if they need to read an object at all.
It also speeds up other operations like cloning, resizing, calculating the size of the image and more.
rbd_header object is the last one we see that is related to the disk image:
root@cephtest:~# rados -p rbd get rbd_header.163530b41c75e0 - root@cephtest:~#
This object seems to be empty. Let's verify it:
root@cephtest:~# rados -p rbd stat rbd_header.163530b41c75e0 - rbd/rbd_header.163530b41c75e0 mtime 2023-02-11T01:01:43.000000+0100, size 0
Indeed, the size is reported as 0. There must be a reason why this object exists. This brings us to the next part.
Objects in Ceph can have metadata associated with them. There are two types, extended attributes, xattr, and a key-value store, the so called omap. Do not confuse an object's omap with the rdb specific object map.
Checking the xattrs for these objects, I only found one, called
lock.rbd_lock present in the object map and header objects for both disk images.
You can list the xattr of an object with:
root@cephtest:~# rados -p rbd listxattr rbd_header.163530b41c75e0 lock.rbd_lock
To get the contents of an xattr:
root@cephtest:~# rados -p rbd getxattr rbd_header.163530b41c75e0 lock.rbd_lock mB5auto 140538980354032internalroot@cephtest:~#
As you see, it contains some data that is probably mainly binary.
To inspect it closer, pipe the output into a tool that can print it, for example
root@cephtest:~# rados -p rbd getxattr rbd_header.163530b41c75e0 lock.rbd_lock | hexdump -C 00000000 01 01 6d 00 00 00 01 00 00 00 01 01 21 00 00 00 |..m.........!...| 00000010 08 42 35 16 00 00 00 00 00 14 00 00 00 61 75 74 |.B5..........aut| 00000020 6f 20 31 34 30 35 33 38 39 38 30 33 35 34 30 33 |o 14053898035403| 00000030 32 01 01 2f 00 00 00 00 00 00 00 00 00 00 00 01 |2../............| 00000040 01 01 1c 00 00 00 01 00 00 00 08 c0 cf 67 10 00 |.............g..| 00000050 00 00 02 00 00 00 c0 a8 05 13 00 00 00 00 00 00 |................| 00000060 00 00 00 00 00 00 01 08 00 00 00 69 6e 74 65 72 |...........inter| 00000070 6e 61 6c |nal| 00000073
What about omap data?
Let's iterate over all objects and print any omap keys that are set.
The command prepends the object names with
>> and prints any keys if present.
1 root@cephtest:~# for l in $(rados -p rbd ls | grep -v rbd_data); do echo ">> $l" && rados -p rbd listomapkeys $l; done 2 >> rbd_header.163530b41c75e0 3 access_timestamp 4 create_timestamp 5 features 6 modify_timestamp 7 object_prefix 8 order 9 size 10 snap_seq 11 >> rbd_directory 12 id_163530b41c75e0 13 id_16353c4794276b 14 name_vm-100-disk-0 15 name_vm-100-disk-1 16 >> rbd_id.vm-100-disk-1 17 >> rbd_info 18 >> rbd_object_map.16353c4794276b 19 >> rbd_object_map.163530b41c75e0 20 >> rbd_trash 21 >> rbd_header.16353c4794276b 22 access_timestamp 23 create_timestamp 24 features 25 modify_timestamp 26 object_prefix 27 order 28 size 29 snap_seq 30 >> rbd_id.vm-100-disk-0
There is quite a bit to go through here. We can see that the following objects have omap keys:
Let's take a look at the keys of the
rbd_header object of the first disk image.
listomapvals sub-command, we get a list of keys and their values in a hexadecimal and ASCII representation.
1 root@cephtest:~# rados -p rbd listomapvals rbd_header.163530b41c75e0 2 access_timestamp 3 value (8 bytes) : 4 00000000 91 cc e6 63 60 b9 3a 26 |...c`.:&| 5 00000008 6 7 create_timestamp 8 value (8 bytes) : 9 00000000 e7 8c e6 63 ba 14 4d 1f |...c..M.| 10 00000008 11 12 features 13 value (8 bytes) : 14 00000000 3d 00 00 00 00 00 00 00 |=.......| 15 00000008 16 17 modify_timestamp 18 value (8 bytes) : 19 00000000 38 dd e6 63 a6 41 b2 2b |8..c.A.+| 20 00000008 21 22 object_prefix 23 value (27 bytes) : 24 00000000 17 00 00 00 72 62 64 5f 64 61 74 61 2e 31 36 33 |....rbd_data.163| 25 00000010 35 33 30 62 34 31 63 37 35 65 30 |530b41c75e0| 26 0000001b 27 28 order 29 value (1 bytes) : 30 00000000 16 |.| 31 00000001 32 33 size 34 value (8 bytes) : 35 00000000 00 00 00 c0 01 00 00 00 |........| 36 00000008 37 38 snap_seq 39 value (8 bytes) : 40 00000000 00 00 00 00 00 00 00 00 |........| 41 00000008
We see a lot of the metadata that was shown in the output of
rbd info vm-100-disk-0.
For example, the different
timestamps, or the
Most is stored as binary data, but we do see the
features are encoded as a binary map.
They are decoded by the
rbd info command and shown by their human-readable name.
General RBD objects
We have covered a single disk image pretty much by now.
But what about the other things?
There are two objects which we ignored for now, the
rbd_directory and the
rbd_directory does not contain any data or xattr.
But it does contain omap values:
1 root@cephtest:~# rados -p rbd listomapvals rbd_directory 2 id_163530b41c75e0 3 value (17 bytes) : 4 00000000 0d 00 00 00 76 6d 2d 31 30 30 2d 64 69 73 6b 2d |....vm-100-disk-| 5 00000010 30 |0| 6 00000011 7 8 id_16353c4794276b 9 value (17 bytes) : 10 00000000 0d 00 00 00 76 6d 2d 31 30 30 2d 64 69 73 6b 2d |....vm-100-disk-| 11 00000010 31 |1| 12 00000011 13 14 name_vm-100-disk-0 15 value (18 bytes) : 16 00000000 0e 00 00 00 31 36 33 35 33 30 62 34 31 63 37 35 |....163530b41c75| 17 00000010 65 30 |e0| 18 00000012 19 20 name_vm-100-disk-1 21 value (18 bytes) : 22 00000000 0e 00 00 00 31 36 33 35 33 63 34 37 39 34 32 37 |....16353c479427| 23 00000010 36 62 |6b| 24 00000012
As you can see, we have a two-way mapping of disk image name to ID and vice versa.
If you run
rbd ls, it is using this data for the listing of images.
Finally, let's check the
root@cephtest:~# rados -p rbd get rbd_info - | hexdump -C 00000000 6f 76 65 72 77 72 69 74 65 20 76 61 6c 69 64 61 |overwrite valida| 00000010 74 65 64 |ted| 00000013
rbd_info contains the string
Why it is here and what it does is not immediately obvious.
After some search in the source code, it became clear that RBD checks if the pool supports overwrite and stores the result here (commit, bug tracker).
There are no xattr or omap values set.
This is it for a basic Ceph RBD pool. Depending on the features used, it is possible that there are more objects present to store additional metadata needed for them. For example, namespaces, mirroring and so forth.
Ceph, or more specifically, rados, stores data in objects. Objects can have additional metadata associated with them: extended attributes (xattr) and omap data, which is a key-value store. The rados block device layer (RBD) is making use of all three possibilities to store data and metadata.
There are other objects that store the metadata of the disk image.
rbd_id.<name> object stores the ID of the disk image.
The omap of the
rbd_header.<ID> object stores most of the metadata of the image, for example all the timestamps, set features, snapshots and so on.
Additionally, the current locking client is stored as a xattr property.
rbd_object_map.<ID> object should not be confused with the omap data an object can have.
It is used to store a binary map of actual existing objects for the current image, snapshot or clone.
rbd_data.<ID>.000… objects, we can find the actual data of the disk image.
RBD stores a two-way mapping of disk image name to ID and vice versa in the omap data of the
The result of the overwrite functionality test is stored in the