Aaron Lauterer

Ceph RBD — Where does it store (meta)data?

Linux linux ceph

Introduction🔗

Ceph offers different functionality on top of the RADOS object store. One of them is the rados block device (RBD) layer. It offers virtual block devices (disks) that are utilized by different hypervisors to store the disks of virtual machines.

How the RBD layer is using the underlying object storage to store actual data of the disk images and the necessary metadata is the topic of this blog post.

Disclaimer: This post sums up my current understanding, and I do not claim that it is 100% correct or exhaustive.

The following examples are done with Ceph installed on Proxmox VE in a hyperconverged manner.

We have one pool, called rbd which stores the disk image of one virtual machine. By naming the pool rbd we can omit the pool name whenever we call the rbd utility, as it is the default pool if no other is provided.

Getting information from RBD itself🔗

The rbd utility is the main CLI tool to interact with the RBD layer itself. All the data it is showing us must be stored somewhere in the layers below. Let's take a look.

The first thing that we can do, is to list all the images present:

root@cephtest:~# rbd ls
vm-100-disk-0
vm-100-disk-1

Right now, this is the main disk of our test VM and a second, currently empty disk image that we will use to manually store some data for demonstrations purposes in a bit.

We can get more information on a specific image as well. Let's get them for the second disk image:

1root@cephtest:~# rbd info vm-100-disk-1
2rbd image 'vm-100-disk-1':
3 size 2 GiB in 512 objects
4 order 22 (4 MiB objects)
5 snapshot_count: 0
6 id: 16353c4794276b
7 block_name_prefix: rbd_data.16353c4794276b
8 format: 2
9 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
10 op_features:
11 flags:
12 create_timestamp: Fri Feb 10 19:29:09 2023
13 access_timestamp: Fri Feb 10 20:02:37 2023
14 modify_timestamp: Fri Feb 10 19:29:09 2023

We see quite some information presented here:

Rados directly🔗

But let's take a look at the layer below, using the rados tool, to inspect the object store itself. Let's run rados -p rbd ls to get a list of all the objects in the pool:

1root@cephtest:~# rados -p rbd ls
2rbd_header.163530b41c75e0
3rbd_directory
4rbd_id.vm-100-disk-1
5rbd_info
6rbd_object_map.16353c4794276b
7rbd_object_map.163530b41c75e0
8rbd_trash
9rbd_header.16353c4794276b
10rbd_id.vm-100-disk-0
11rbd_data.163530b41c75e0.0000000000000000
12rbd_data.163530b41c75e0.0000000000000001
13rbd_data.163530b41c75e0.0000000000000002
14rbd_data.163530b41c75e0.0000000000000009
15rbd_data.163530b41c75e0.000000000000000a
16rbd_data.163530b41c75e0.000000000000000b
17rbd_data.163530b41c75e0.0000000000000020
18rbd_data.163530b41c75e0.0000000000000021
19[...]

Did you notice the -p rbd parameter? We will always have to provide the pool name to the rados command.

As you can see, there are quite some objects that look like they might contain metadata. The rbd_directory, rbd_object_map, rbd_id, rbd_info, and the rbd_header are likely candidates. But let's focus on the data objects for now before we dive into the metadata rabbit hole.

Data objects🔗

Let's look at the data objects belonging to the second disk image. It has the ID 16353c4794276b, so the data object names should all start with rbd_data.16353c4794276b.

root@cephtest:~# rados -p rbd ls | grep  rbd_data.16353c4794276b
root@cephtest:~#

Looks like there are none available. This is understandable, as we haven't written anything to that disk yet, not even a partition table.

We can write some test data to the disk directly from within the VM:

root@testVM:~# echo FOOBAR > /dev/sdb
root@testVM:~#

Then check again for data objects:

root@cephtest:~# rados -p rbd ls | grep  rbd_data.16353c4794276b
rbd_data.16353c4794276b.0000000000000000
root@cephtest:~#

As you can see, we got one data object with the ID 0. We can access it directly and either save the contents to a file, or directly to the CLI if the second parameter is -:

root@cephtest:~# rados -p rbd get rbd_data.16353c4794276b.0000000000000000 -
FOOBAR
root@cephtest:~#

As you can see, it contains the string we wrote to the disk from inside the VM. Another thing we can try is to copy the same data to another location on the disk:

root@testVM:~# dd if=/dev/sdb of=/dev/sdb bs=1M count=1 seek=512
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.076572 s, 13.7 MB/s
root@testVM:~#

With this, we copied the first MiB of the disk to an offset of 512 MiB. If we list all the rbd_data objects for this disk image (I am piping it to sort so that the objects are listed in the correct order):

root@cephtest:~# rados -p rbd ls | grep  rbd_data.16353c4794276b | sort
rbd_data.16353c4794276b.0000000000000000
rbd_data.16353c4794276b.0000000000000080
root@cephtest:~#

The second object has the ID 80. Since the ID is stored as hexadecimal, we can convert it to decimal where the ID is 128. Given that we copied the data to an offset of 512 MiB, and that each RBD object contains 4 MiB, this is exactly the object ID matching to that location on the disk: 512 MiB / 4 MiB = 128. Therefore, we can safely deduct, that the object ID for a particular disk image, corresponds with the location within the disk image.

If you run the rados -p rbd get command on any of the rbd_data objects on a disk image that contains real life data, it will show you the raw disk contents of that part of the disk. In most situations, this will be binary data. Printing it directly to the terminal, can lead to a messed up terminal. You can try the reset command in such a situation.

Metadata objects🔗

The other, non rbd_data objects are the one we want to take a closer look now. If you want to filter the output of all the rbd_data objects, inverting the grep output helps, for example rados -p rbd ls | grep -v rbd_data.

1root@cephtest:~# rados -p rbd ls | grep -v rbd_data
2rbd_header.163530b41c75e0
3rbd_directory
4rbd_id.vm-100-disk-1
5rbd_info
6rbd_object_map.16353c4794276b
7rbd_object_map.163530b41c75e0
8rbd_trash
9rbd_header.16353c4794276b
10rbd_id.vm-100-disk-0

We seem to have some objects that contain the actual image name, and a few more that contain the ID that we saw in the rbd info output earlier. There are also a few objects that seem to be general rbd objects, as they don't have any image name or ID in their name. Those would be, for example, the rbd_directory and rbd_info. We will ignore them for now.

We can take a look at the data stored in an object. What does the rbd_id.vm-100-disk-0 object contain?

root@cephtest:~# rados -p rbd get rbd_id.vm-100-disk-0 -
163530b41c75e0root@cephtest:~#

Since there is nothing at the end of the object that the terminal would interpret as a newline, the prompt gets printed right after it, making it a bit hard to read. But we can see that it contains the ID of the first disk image, 163530b41c75e0.

There is an object map for the ID present (rbd_object_map.163530b41c75e0). It contains binary data, which will look like a lot of gibberish when interpreted as text. Don't expect any useful information when you print it directly to the CLI. The object map stores which objects of an image actually exist and where. This helps clients to determine if they need to read an object at all. It also speeds up other operations like cloning, resizing, calculating the size of the image and more.

The rbd_header object is the last one we see that is related to the disk image:

root@cephtest:~# rados -p rbd get rbd_header.163530b41c75e0 -
root@cephtest:~#

This object seems to be empty. Let's verify it:

root@cephtest:~# rados -p rbd stat rbd_header.163530b41c75e0 -
rbd/rbd_header.163530b41c75e0 mtime 2023-02-11T01:01:43.000000+0100, size 0

Indeed, the size is reported as 0. There must be a reason why this object exists. This brings us to the next part.

Object metadata🔗

Objects in Ceph can have metadata associated with them. There are two types, extended attributes, xattr, and a key-value store, the so called omap. Do not confuse an object's omap with the rdb specific object map.

Checking the xattrs for these objects, I only found one, called lock.rbd_lock present in the object map and header objects for both disk images.

You can list the xattr of an object with:

root@cephtest:~# rados -p rbd listxattr rbd_header.163530b41c75e0
lock.rbd_lock

To get the contents of an xattr:

root@cephtest:~# rados -p rbd getxattr rbd_header.163530b41c75e0 lock.rbd_lock
mB5auto 140538980354032internalroot@cephtest:~#

As you see, it contains some data that is probably mainly binary. To inspect it closer, pipe the output into a tool that can print it, for example hexdump:

root@cephtest:~# rados -p rbd getxattr rbd_header.163530b41c75e0 lock.rbd_lock | hexdump -C
00000000  01 01 6d 00 00 00 01 00  00 00 01 01 21 00 00 00  |..m.........!...|
00000010  08 42 35 16 00 00 00 00  00 14 00 00 00 61 75 74  |.B5..........aut|
00000020  6f 20 31 34 30 35 33 38  39 38 30 33 35 34 30 33  |o 14053898035403|
00000030  32 01 01 2f 00 00 00 00  00 00 00 00 00 00 00 01  |2../............|
00000040  01 01 1c 00 00 00 01 00  00 00 08 c0 cf 67 10 00  |.............g..|
00000050  00 00 02 00 00 00 c0 a8  05 13 00 00 00 00 00 00  |................|
00000060  00 00 00 00 00 00 01 08  00 00 00 69 6e 74 65 72  |...........inter|
00000070  6e 61 6c                                          |nal|
00000073

What about omap data? Let's iterate over all objects and print any omap keys that are set. The command prepends the object names with >> and prints any keys if present.

1root@cephtest:~# for l in $(rados -p rbd ls | grep -v rbd_data); do echo ">> $l" && rados -p rbd listomapkeys $l; done
2>> rbd_header.163530b41c75e0
3access_timestamp
4create_timestamp
5features
6modify_timestamp
7object_prefix
8order
9size
10snap_seq
11>> rbd_directory
12id_163530b41c75e0
13id_16353c4794276b
14name_vm-100-disk-0
15name_vm-100-disk-1
16>> rbd_id.vm-100-disk-1
17>> rbd_info
18>> rbd_object_map.16353c4794276b
19>> rbd_object_map.163530b41c75e0
20>> rbd_trash
21>> rbd_header.16353c4794276b
22access_timestamp
23create_timestamp
24features
25modify_timestamp
26object_prefix
27order
28size
29snap_seq
30>> rbd_id.vm-100-disk-0

There is quite a bit to go through here. We can see that the following objects have omap keys:

Let's take a look at the keys of the rbd_header object of the first disk image. With the listomapvals sub-command, we get a list of keys and their values in a hexadecimal and ASCII representation.

1root@cephtest:~# rados -p rbd listomapvals rbd_header.163530b41c75e0
2access_timestamp
3value (8 bytes) :
400000000 91 cc e6 63 60 b9 3a 26 |...c`.:&|
500000008
6
7create_timestamp
8value (8 bytes) :
900000000 e7 8c e6 63 ba 14 4d 1f |...c..M.|
1000000008
11
12features
13value (8 bytes) :
1400000000 3d 00 00 00 00 00 00 00 |=.......|
1500000008
16
17modify_timestamp
18value (8 bytes) :
1900000000 38 dd e6 63 a6 41 b2 2b |8..c.A.+|
2000000008
21
22object_prefix
23value (27 bytes) :
2400000000 17 00 00 00 72 62 64 5f 64 61 74 61 2e 31 36 33 |....rbd_data.163|
2500000010 35 33 30 62 34 31 63 37 35 65 30 |530b41c75e0|
260000001b
27
28order
29value (1 bytes) :
3000000000 16 |.|
3100000001
32
33size
34value (8 bytes) :
3500000000 00 00 00 c0 01 00 00 00 |........|
3600000008
37
38snap_seq
39value (8 bytes) :
4000000000 00 00 00 00 00 00 00 00 |........|
4100000008

We see a lot of the metadata that was shown in the output of rbd info vm-100-disk-0. For example, the different timestamps, or the size. Most is stored as binary data, but we do see the object_prefix, a snap_seq. The features are encoded as a binary map. They are decoded by the rbd info command and shown by their human-readable name.

General RBD objects🔗

We have covered a single disk image pretty much by now. But what about the other things? There are two objects which we ignored for now, the rbd_directory and the rbd_info.

The rbd_directory does not contain any data or xattr. But it does contain omap values:

1root@cephtest:~# rados -p rbd listomapvals rbd_directory
2id_163530b41c75e0
3value (17 bytes) :
400000000 0d 00 00 00 76 6d 2d 31 30 30 2d 64 69 73 6b 2d |....vm-100-disk-|
500000010 30 |0|
600000011
7
8id_16353c4794276b
9value (17 bytes) :
1000000000 0d 00 00 00 76 6d 2d 31 30 30 2d 64 69 73 6b 2d |....vm-100-disk-|
1100000010 31 |1|
1200000011
13
14name_vm-100-disk-0
15value (18 bytes) :
1600000000 0e 00 00 00 31 36 33 35 33 30 62 34 31 63 37 35 |....163530b41c75|
1700000010 65 30 |e0|
1800000012
19
20name_vm-100-disk-1
21value (18 bytes) :
2200000000 0e 00 00 00 31 36 33 35 33 63 34 37 39 34 32 37 |....16353c479427|
2300000010 36 62 |6b|
2400000012

As you can see, we have a two-way mapping of disk image name to ID and vice versa. If you run rbd ls, it is using this data for the listing of images.

Finally, let's check the rbd_info object:

root@cephtest:~# rados -p rbd get rbd_info - | hexdump -C
00000000  6f 76 65 72 77 72 69 74  65 20 76 61 6c 69 64 61  |overwrite valida|
00000010  74 65 64                                          |ted|
00000013

The rbd_info contains the string overwrite validated. Why it is here and what it does is not immediately obvious. After some search in the source code, it became clear that RBD checks if the pool supports overwrite and stores the result here (commit, bug tracker). There are no xattr or omap values set.

This is it for a basic Ceph RBD pool. Depending on the features used, it is possible that there are more objects present to store additional metadata needed for them. For example, namespaces, mirroring and so forth.

Conclusion, TL;DR🔗

Ceph, or more specifically, rados, stores data in objects. Objects can have additional metadata associated with them: extended attributes (xattr) and omap data, which is a key-value store. The rados block device layer (RBD) is making use of all three possibilities to store data and metadata.

There are other objects that store the metadata of the disk image. The rbd_id.<name> object stores the ID of the disk image. The omap of the rbd_header.<ID> object stores most of the metadata of the image, for example all the timestamps, set features, snapshots and so on. Additionally, the current locking client is stored as a xattr property. The optional rbd_object_map.<ID> object should not be confused with the omap data an object can have. It is used to store a binary map of actual existing objects for the current image, snapshot or clone. In the rbd_data.<ID>.000… objects, we can find the actual data of the disk image.

RBD stores a two-way mapping of disk image name to ID and vice versa in the omap data of the rbd_directory object. The result of the overwrite functionality test is stored in the rbd_info object.

Got any hints or questions? blog@aaronlauterer.com