Ceph RBD — Where does it store (meta)data?
Introduction🔗
Ceph offers different functionality on top of the RADOS object store. One of them is the rados block device (RBD) layer. It offers virtual block devices (disks) that are utilized by different hypervisors to store the disks of virtual machines.
How the RBD layer is using the underlying object storage to store actual data of the disk images and the necessary metadata is the topic of this blog post.
Disclaimer: This post sums up my current understanding, and I do not claim that it is 100% correct or exhaustive.
The following examples are done with Ceph installed on Proxmox VE in a hyperconverged manner.
We have one pool, called rbd which stores the disk image of one virtual machine.
By naming the pool rbd we can omit the pool name whenever we call the rbd
utility, as it is the default pool if no other is provided.
Getting information from RBD itself🔗
The rbd
utility is the main CLI tool to interact with the RBD layer itself.
All the data it is showing us must be stored somewhere in the layers below.
Let's take a look.
The first thing that we can do, is to list all the images present:
root@cephtest:~# rbd ls
vm-100-disk-0
vm-100-disk-1
Right now, this is the main disk of our test VM and a second, currently empty disk image that we will use to manually store some data for demonstrations purposes in a bit.
We can get more information on a specific image as well. Let's get them for the second disk image:
1 root@cephtest:~# rbd info vm-100-disk-1
2 rbd image 'vm-100-disk-1':
3 size 2 GiB in 512 objects
4 order 22 (4 MiB objects)
5 snapshot_count: 0
6 id: 16353c4794276b
7 block_name_prefix: rbd_data.16353c4794276b
8 format: 2
9 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
10 op_features:
11 flags:
12 create_timestamp: Fri Feb 10 19:29:09 2023
13 access_timestamp: Fri Feb 10 20:02:37 2023
14 modify_timestamp: Fri Feb 10 19:29:09 2023
We see quite some information presented here:
- the size of the disk image (2 GiB)
- the number of objects in which those 2 GiB are stored in, 512 objects
- how much data each object holds (4 MiB)
- the number of snapshots (0)
- an ID:
16353c4794276b
- a prefix which seems to contain the ID:
rbd_data.16353c4794276b
. This is the rados object prefix, as we will see just in a bit. - enabled features
- no flags seem to be set
- timestamps for creation, access and modification times
Rados directly🔗
But let's take a look at the layer below, using the rados
tool, to inspect the object store itself.
Let's run rados -p rbd ls
to get a list of all the objects in the pool:
1 root@cephtest:~# rados -p rbd ls
2 rbd_header.163530b41c75e0
3 rbd_directory
4 rbd_id.vm-100-disk-1
5 rbd_info
6 rbd_object_map.16353c4794276b
7 rbd_object_map.163530b41c75e0
8 rbd_trash
9 rbd_header.16353c4794276b
10 rbd_id.vm-100-disk-0
11 rbd_data.163530b41c75e0.0000000000000000
12 rbd_data.163530b41c75e0.0000000000000001
13 rbd_data.163530b41c75e0.0000000000000002
14 rbd_data.163530b41c75e0.0000000000000009
15 rbd_data.163530b41c75e0.000000000000000a
16 rbd_data.163530b41c75e0.000000000000000b
17 rbd_data.163530b41c75e0.0000000000000020
18 rbd_data.163530b41c75e0.0000000000000021
19 [...]
Did you notice the -p rbd
parameter? We will always have to provide the pool name to the rados
command.
As you can see, there are quite some objects that look like they might contain metadata.
The rbd_directory
, rbd_object_map
, rbd_id
, rbd_info
, and the rbd_header
are likely candidates.
But let's focus on the data objects for now before we dive into the metadata rabbit hole.
Data objects🔗
Let's look at the data objects belonging to the second disk image.
It has the ID 16353c4794276b
, so the data object names should all start with rbd_data.16353c4794276b
.
root@cephtest:~# rados -p rbd ls | grep rbd_data.16353c4794276b
root@cephtest:~#
Looks like there are none available. This is understandable, as we haven't written anything to that disk yet, not even a partition table.
We can write some test data to the disk directly from within the VM:
root@testVM:~# echo FOOBAR > /dev/sdb
root@testVM:~#
Then check again for data objects:
root@cephtest:~# rados -p rbd ls | grep rbd_data.16353c4794276b
rbd_data.16353c4794276b.0000000000000000
root@cephtest:~#
As you can see, we got one data object with the ID 0
.
We can access it directly and either save the contents to a file, or directly to the CLI if the second parameter is -
:
root@cephtest:~# rados -p rbd get rbd_data.16353c4794276b.0000000000000000 -
FOOBAR
root@cephtest:~#
As you can see, it contains the string we wrote to the disk from inside the VM. Another thing we can try is to copy the same data to another location on the disk:
root@testVM:~# dd if=/dev/sdb of=/dev/sdb bs=1M count=1 seek=512
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.076572 s, 13.7 MB/s
root@testVM:~#
With this, we copied the first MiB of the disk to an offset of 512 MiB.
If we list all the rbd_data
objects for this disk image (I am piping it to sort
so that the objects are listed in the correct order):
root@cephtest:~# rados -p rbd ls | grep rbd_data.16353c4794276b | sort
rbd_data.16353c4794276b.0000000000000000
rbd_data.16353c4794276b.0000000000000080
root@cephtest:~#
The second object has the ID 80
.
Since the ID is stored as hexadecimal, we can convert it to decimal where the ID is 128.
Given that we copied the data to an offset of 512 MiB, and that each RBD object contains 4 MiB, this is exactly the object ID matching to that location on the disk: 512 MiB / 4 MiB = 128.
Therefore, we can safely deduct, that the object ID for a particular disk image, corresponds with the location within the disk image.
If you run the rados -p rbd get
command on any of the rbd_data
objects on a disk image that contains real life data, it will show you the raw disk contents of that part of the disk.
In most situations, this will be binary data.
Printing it directly to the terminal, can lead to a messed up terminal.
You can try the reset
command in such a situation.
Metadata objects🔗
The other, non rbd_data
objects are the one we want to take a closer look now.
If you want to filter the output of all the rbd_data
objects, inverting the grep
output helps, for example rados -p rbd ls | grep -v rbd_data
.
1 root@cephtest:~# rados -p rbd ls | grep -v rbd_data
2 rbd_header.163530b41c75e0
3 rbd_directory
4 rbd_id.vm-100-disk-1
5 rbd_info
6 rbd_object_map.16353c4794276b
7 rbd_object_map.163530b41c75e0
8 rbd_trash
9 rbd_header.16353c4794276b
10 rbd_id.vm-100-disk-0
We seem to have some objects that contain the actual image name, and a few more that contain the ID that we saw in the rbd info
output earlier.
There are also a few objects that seem to be general rbd objects, as they don't have any image name or ID in their name.
Those would be, for example, the rbd_directory
and rbd_info
.
We will ignore them for now.
We can take a look at the data stored in an object. What does the rbd_id.vm-100-disk-0
object contain?
root@cephtest:~# rados -p rbd get rbd_id.vm-100-disk-0 -
163530b41c75e0root@cephtest:~#
Since there is nothing at the end of the object that the terminal would interpret as a newline, the prompt gets printed right after it, making it a bit hard to read.
But we can see that it contains the ID of the first disk image, 163530b41c75e0
.
There is an object map for the ID present (rbd_object_map.163530b41c75e0
).
It contains binary data, which will look like a lot of gibberish when interpreted as text.
Don't expect any useful information when you print it directly to the CLI.
The object map stores which objects of an image actually exist and where.
This helps clients to determine if they need to read an object at all.
It also speeds up other operations like cloning, resizing, calculating the size of the image and more.
The rbd_header
object is the last one we see that is related to the disk image:
root@cephtest:~# rados -p rbd get rbd_header.163530b41c75e0 -
root@cephtest:~#
This object seems to be empty. Let's verify it:
root@cephtest:~# rados -p rbd stat rbd_header.163530b41c75e0 -
rbd/rbd_header.163530b41c75e0 mtime 2023-02-11T01:01:43.000000+0100, size 0
Indeed, the size is reported as 0. There must be a reason why this object exists. This brings us to the next part.
Object metadata🔗
Objects in Ceph can have metadata associated with them. There are two types, extended attributes, xattr, and a key-value store, the so called omap. Do not confuse an object's omap with the rdb specific object map.
Checking the xattrs for these objects, I only found one, called lock.rbd_lock
present in the object map and header objects for both disk images.
You can list the xattr of an object with:
root@cephtest:~# rados -p rbd listxattr rbd_header.163530b41c75e0
lock.rbd_lock
To get the contents of an xattr:
root@cephtest:~# rados -p rbd getxattr rbd_header.163530b41c75e0 lock.rbd_lock
mB5auto 140538980354032internalroot@cephtest:~#
As you see, it contains some data that is probably mainly binary.
To inspect it closer, pipe the output into a tool that can print it, for example hexdump
:
root@cephtest:~# rados -p rbd getxattr rbd_header.163530b41c75e0 lock.rbd_lock | hexdump -C
00000000 01 01 6d 00 00 00 01 00 00 00 01 01 21 00 00 00 |..m.........!...|
00000010 08 42 35 16 00 00 00 00 00 14 00 00 00 61 75 74 |.B5..........aut|
00000020 6f 20 31 34 30 35 33 38 39 38 30 33 35 34 30 33 |o 14053898035403|
00000030 32 01 01 2f 00 00 00 00 00 00 00 00 00 00 00 01 |2../............|
00000040 01 01 1c 00 00 00 01 00 00 00 08 c0 cf 67 10 00 |.............g..|
00000050 00 00 02 00 00 00 c0 a8 05 13 00 00 00 00 00 00 |................|
00000060 00 00 00 00 00 00 01 08 00 00 00 69 6e 74 65 72 |...........inter|
00000070 6e 61 6c |nal|
00000073
What about omap data?
Let's iterate over all objects and print any omap keys that are set.
The command prepends the object names with >>
and prints any keys if present.
1 root@cephtest:~# for l in $(rados -p rbd ls | grep -v rbd_data); do echo ">> $l" && rados -p rbd listomapkeys $l; done
2 >> rbd_header.163530b41c75e0
3 access_timestamp
4 create_timestamp
5 features
6 modify_timestamp
7 object_prefix
8 order
9 size
10 snap_seq
11 >> rbd_directory
12 id_163530b41c75e0
13 id_16353c4794276b
14 name_vm-100-disk-0
15 name_vm-100-disk-1
16 >> rbd_id.vm-100-disk-1
17 >> rbd_info
18 >> rbd_object_map.16353c4794276b
19 >> rbd_object_map.163530b41c75e0
20 >> rbd_trash
21 >> rbd_header.16353c4794276b
22 access_timestamp
23 create_timestamp
24 features
25 modify_timestamp
26 object_prefix
27 order
28 size
29 snap_seq
30 >> rbd_id.vm-100-disk-0
There is quite a bit to go through here. We can see that the following objects have omap keys:
- rbd_directory
- rbd_header.163530b41c75e0
- rbd_header.16353c4794276b
Let's take a look at the keys of the rbd_header
object of the first disk image.
With the listomapvals
sub-command, we get a list of keys and their values in a hexadecimal and ASCII representation.
1 root@cephtest:~# rados -p rbd listomapvals rbd_header.163530b41c75e0
2 access_timestamp
3 value (8 bytes) :
4 00000000 91 cc e6 63 60 b9 3a 26 |...c`.:&|
5 00000008
6
7 create_timestamp
8 value (8 bytes) :
9 00000000 e7 8c e6 63 ba 14 4d 1f |...c..M.|
10 00000008
11
12 features
13 value (8 bytes) :
14 00000000 3d 00 00 00 00 00 00 00 |=.......|
15 00000008
16
17 modify_timestamp
18 value (8 bytes) :
19 00000000 38 dd e6 63 a6 41 b2 2b |8..c.A.+|
20 00000008
21
22 object_prefix
23 value (27 bytes) :
24 00000000 17 00 00 00 72 62 64 5f 64 61 74 61 2e 31 36 33 |....rbd_data.163|
25 00000010 35 33 30 62 34 31 63 37 35 65 30 |530b41c75e0|
26 0000001b
27
28 order
29 value (1 bytes) :
30 00000000 16 |.|
31 00000001
32
33 size
34 value (8 bytes) :
35 00000000 00 00 00 c0 01 00 00 00 |........|
36 00000008
37
38 snap_seq
39 value (8 bytes) :
40 00000000 00 00 00 00 00 00 00 00 |........|
41 00000008
We see a lot of the metadata that was shown in the output of rbd info vm-100-disk-0
.
For example, the different timestamps
, or the size
.
Most is stored as binary data, but we do see the object_prefix
, a snap_seq
.
The features
are encoded as a binary map.
They are decoded by the rbd info
command and shown by their human-readable name.
General RBD objects🔗
We have covered a single disk image pretty much by now.
But what about the other things?
There are two objects which we ignored for now, the rbd_directory
and the rbd_info
.
The rbd_directory
does not contain any data or xattr.
But it does contain omap values:
1 root@cephtest:~# rados -p rbd listomapvals rbd_directory
2 id_163530b41c75e0
3 value (17 bytes) :
4 00000000 0d 00 00 00 76 6d 2d 31 30 30 2d 64 69 73 6b 2d |....vm-100-disk-|
5 00000010 30 |0|
6 00000011
7
8 id_16353c4794276b
9 value (17 bytes) :
10 00000000 0d 00 00 00 76 6d 2d 31 30 30 2d 64 69 73 6b 2d |....vm-100-disk-|
11 00000010 31 |1|
12 00000011
13
14 name_vm-100-disk-0
15 value (18 bytes) :
16 00000000 0e 00 00 00 31 36 33 35 33 30 62 34 31 63 37 35 |....163530b41c75|
17 00000010 65 30 |e0|
18 00000012
19
20 name_vm-100-disk-1
21 value (18 bytes) :
22 00000000 0e 00 00 00 31 36 33 35 33 63 34 37 39 34 32 37 |....16353c479427|
23 00000010 36 62 |6b|
24 00000012
As you can see, we have a two-way mapping of disk image name to ID and vice versa.
If you run rbd ls
, it is using this data for the listing of images.
Finally, let's check the rbd_info
object:
root@cephtest:~# rados -p rbd get rbd_info - | hexdump -C
00000000 6f 76 65 72 77 72 69 74 65 20 76 61 6c 69 64 61 |overwrite valida|
00000010 74 65 64 |ted|
00000013
The rbd_info
contains the string overwrite validated
.
Why it is here and what it does is not immediately obvious.
After some search in the source code, it became clear that RBD checks if the pool supports overwrite and stores the result here (commit, bug tracker).
There are no xattr or omap values set.
This is it for a basic Ceph RBD pool. Depending on the features used, it is possible that there are more objects present to store additional metadata needed for them. For example, namespaces, mirroring and so forth.
Conclusion, TL;DR🔗
Ceph, or more specifically, rados, stores data in objects. Objects can have additional metadata associated with them: extended attributes (xattr) and omap data, which is a key-value store. The rados block device layer (RBD) is making use of all three possibilities to store data and metadata.
There are other objects that store the metadata of the disk image.
The rbd_id.<name>
object stores the ID of the disk image.
The omap of the rbd_header.<ID>
object stores most of the metadata of the image, for example all the timestamps, set features, snapshots and so on.
Additionally, the current locking client is stored as a xattr property.
The optional rbd_object_map.<ID>
object should not be confused with the omap data an object can have.
It is used to store a binary map of actual existing objects for the current image, snapshot or clone.
In the rbd_data.<ID>.000…
objects, we can find the actual data of the disk image.
RBD stores a two-way mapping of disk image name to ID and vice versa in the omap data of the rbd_directory
object.
The result of the overwrite functionality test is stored in the rbd_info
object.