ZFS Writeback Cache with LVM Caching

Dzeri, 04-04-2021, Storage

Using LVM to add extra caching to ZFS in Linux

Linux, LVM, Storage, ZFS

Use case

I got a somewhat old second-hand PC from a friend which I decided to turn into a NAS/Home Server, onto which I'd let my devices backup. I have two 1TB 7200rpm HDDs from WD and one Sabrent 256GB NVMe SSD. I wanted to use btrfs or ZFS over plain RAID1 because they protect against bit rot in addition to drive failure, and have features like snapshotting, deduplication and encryption. I plan on making incremental backups of my laptop and my Raspberry Pi hosting Nextcloud every week or so, meaning that I don't expect big sequential write sessions after the initial full backups have been conducted.

Now, using just the two HDDs on their own would be fine on my civilian barely-gigabit network, but I want to make use of the insanely fast NVMe drive by turning a part of it into a cache, both for writing and reading. If you are unfamiliar with writeback caches, the idea is that you have a smaller volume from which you can read, and write to very fast. This data is eventually transferred to your slow drives, meaning that you get the speed and resilience at the same time, provided you don't transfer more than you cache can handle at once. Also, if you use some files frequently, your computer wouldn't have to fetch them from the slow drives each time, which is a big deal because random reads are the biggest bottleneck when it comes to traditional spinning-platter HDDs.

Issues and criticism

Chances are you've already seen some internet discussions where people criticize this setup because "the use case is just not there", "it's overkill", "you still have the x bottleneck" and so on. Well damn it this is my blog, and I still believe it's worth doing!

Before we begin though, there is one actual drawback to this setup, and it's that, if your cache dies or gets corrupted before it writes out any new data, this data will get lost/corrupted and ZFS or btrfs can't do anything about it. For me this risk is perfectly acceptable, as after the initial backups have been complete, I can only potentially lose one incremental backup per week, and that's extremely unlikely. If you wish to be extra safe however, even your cache can be protected by RAID equivalents in ZFS or btrfs. Buying a UPS is of course always a good idea, but where I live, we haven't had a single black- or brown-out in the last 6 years, so I'm not too worried in that regard.

One more thing you might be asking is "wait, doesn't ZFS support caching out of the box?". Well kind of. A read cache is supported and it absolutely does work, but there is no actual support for a writeback cache. The only thing you can do is save the ZIL on your fast storage, and while this does introduce a write speed improvement (in my tests it was about 30%), I believe we can do better. When it comes to btrfs, I'm not aware of any caching strategies that can be manually set up.

The set-up

Before we begin, I have to admit I'm an absolute beginner when it comes to ZFS and for all I know, I'm doing something completely wrong. In any case, there is room for improvement and I'll address some things that I'm aware of in the appropriate section.

The sctructure

Starting from the bottom, we have the 3 actual hard drives, with the NVMe SSD having one of its physical partitions be an LVM Physical Volume, and two other partitions serving as the Read cache and SLOG in a ZFS pool The zpool is built from the two HDDs in a mirroring configuration. At this point you'll notice that I'm using both ZFS's caching capabilities and a separate lvm cache on top of that. Theoretically this should increase the cache hits when reading since I basically have 2-tier caching where both tiers have the same speed. On the writing side, the separate SLOG should slightly increase the speed at which new data is offloaded from the LVM cache, but should not make a difference for small writes. I explore this theory in the benchmarking section. In short, the addition of ZFS caches does seem to make a difference, but the findings are pretty inconsistent.

The storage structure

The LVM cachepool is built from two LVM volumes residing on the NVMe SSD. The first one holds the actual data, and the other holds the metadata. This is just how LVM does caching and I won't go further into detail. Just know that hosting these two volumes on separate physical drives could further increase performance.

Going further up, the ZPool exposes a ZFS volume instead of a file system. This volume looks like a regular block device to the system, but still has all of ZFS's nice features included in the background.

Finally, we bring it all together by creating an LVM cache volume, built from the previously created cachepool and the ZFS volume. This is again abstracted to a block device and we can format it with any file system we want (even btrfs, which would possibly bring peace between the two camps, or more realistically, make no sense).

How to reproduce it

  1. Format your NVMe how you want.

    I used the KDE partition manager to create an LVM PV partition on the disk, since for some reason fdisk didn't want to cooperate. If you want to avoid my mistakes, use and extended partition and subdivide it into the LVM PV and two logical partitions for the ZFS caches.

  2. Create the ZFS pool

    sudo zpool create -m /media/tank tank mirror /dev/sda /dev/sdb
    

    This will create a zpool called tank and mount it under /media/tank as a ZFS filesystem. I haven't been able to figure out how to skip the creation of the actual FS from the pool, but it's no big deal since it will dynamically get shrunk once we create the actual ZFS volume.

    Optional: disable mounting of pool with sudo zfs set mountpoint=none tank

  3. Add the caches to your ZFS pool

    sudo zpool add tank log /dev/nvme0n1p6
    sudo zpool add tank cache /dev/nvme0n1p7
    
  4. Create an LVM Volume Group

    sudo vgcreate zfsCacheVG /dev/nvme0n1p8
    
    In my case, /dev/nvme0n1p4 is the LVM physical volume on the SSD.

  5. Create the necessairy LVM volumes

    sudo lvcreate -L 80G -n lvCache zfsCacheVG /dev/nvme0n1p8
    sudo lvcreate -L 200M -n lvCacheMeta zfsCacheVG /dev/nvme0n1p8
    
    You can set apropriate sizes. Note that the zfsWriteCache (SLOG) is probably far too big for my needs.

  6. Create the ZFS cachepool

    sudo lvconvert --type cache-pool --cachemode writeback --poolmetadata zfsCacheVG/lvCacheMeta zfsCacheVG/lvCache
    
  7. Expose the ZFS volume

    sudo zfs create -V 860G tank/vdisk
    
    Again, adjust the size to your needs

  8. Add the ZFS volume to the LVM Volume Group

    sudo vgextend zfsCacheVG /dev/tank/vdisk
    
  9. Create an LVM Physical Volume on top of the ZFS volume

    sudo pvcreate /dev/tank/vdisk
    
  10. Create an LVM Logical Volume from the newly created PV

    sudo lvcreate -L 859G -n slowZFS zfsCacheVG /dev/tank/vdisk
    
    This creates an LV with the name slowZFS

  11. Create the LVM Cache Volume

    sudo lvconvert --type cache --cachepool zfsCacheVG/lvCache zfsCacheVG/slowZFS
    
  12. Format the new drive

    sudo mkfs.ext4 /dev/zfsCacheVG/slowZFS
    
    I chose ext4 as my FS, but something even more lightweight could be beneficial. Also note the name slowZFS. This is indeed the new "fast" cached volume, but it inherited the name from the original LVM Logical Volume.

  13. Mount it and enjoy!

    sudo mount /dev/zfsCacheVG/slowZFS /media/fastTank
    

And we are done! It should also be a good idea to add the mount entry to /etc/fstab so it always gets mounted on boot. At this point you can start enjoying your Frankenstein strorage architecture. And they said it couldn't be done...

Benchmarking

Ok this is where I have to admit I dropped the ball. Benchmarking is a topic that can run extremely deep, and I just wanted to see some quick numbers. I basically tried some random scripts I found off the internet and haven't controlled for things like the system's RAM cache. To make matters worse, this blog post is written across several months, and at this point I can't remember if I had set up everything like the file names suggest, or if some of these runs were a fluke. For this reason I'm not going to post any concrete benchmarking results. All I can say is that the performance seemed to be better in general.

In terms of how I did it, I wanted something similar to crystaldiskmark on linux so I can roughly estimate the setup's performance.

I used the following script:

#!/bin/bash

fio --loops=5 --size=1000m --filename=$1 --stonewall --ioengine=libaio --direct=1 \
  --name=Seqread --bs=1m --rw=read \
  --name=Seqwrite --bs=1m --rw=write \
  --name=512Kread --bs=512k --rw=randread \
  --name=512Kwrite --bs=512k --rw=randwrite \
  --name=4kQD32read --bs=4k --iodepth=32 --rw=randread \
  --name=4kQD32write --bs=4k --iodepth=32 --rw=randwrite

Since I'm already using this server, I'll need to do a proper suite of benchmarks the next time I attempt this crazy idea.