Linux md RAID benchmarks on 4 Kioxia PM5s

I had access to some machines with a load of brand new Kioxia PM5 (read-intensive SAS3) SSDs, so after finishing my production benchmarking I decided to grab some academic numbers to follow up on my prior md RAID layouts writeup.

I’ll be benchmarking just four drives here, as the PCIE 3.0x8 HBA they’re hooked up to is a limiting factor for anything wider. These drives really could use a NVME interface (that’s what the CM5 is!)

I’m testing with default writeback caching enabled, but bypassing the page cache (--direct=1). This is perfectly fine and reflects any production workload on enterprise drives, as these PM5s have power-loss prevention - using their caches is safe, the drives guarantee they’ll write the write cache to NAND even if power is lost.

These drives have a 12 Gb/s SAS3 interface, and are connected to a single LSI 3008 (so they’re in single-port mode, not in dual-port mode with multipath - this means performance is about half what you’ll see on the spec sheets/half of what it could be) via the chassis’ SAS expander. The rest of the test machines are:

The testing node was completely empty (no guests or other workloads).

These results clearly illustrate the architecture of different md RAID layouts. Let’s dive in!

fio jobs

See gist. Essentially:

Single drive

We’ll start off with a baseline.. a single 1.92 Tb Kioxia PM5 (KPM5XRUG1T92).

Here we see strong sequential I/O (about as good as can be expected from a drive on a single 12 Gb/s SAS interface), along with healthy random I/O performance. All in all, a decent result for some cheaper drives.

4-wide mirror

Our first RAID configuration will be a four-way mirror. This will allow us to parallelize reads between the four drives, but writes will be limited by the performance of a single disk. Our random reads will also parallelize nicely when we’ve got more than one thread doing the reading.

Note how our random Q1 T1 I/O (e.g., ad-hoc, just testing latency) is significantly degraded the second we throw the multiple device driver in between fio and the disks. This will be a consistent factor with all of our md RAID results.

mdadm --create /dev/md2 --level=1 --raid-devices=4 /dev/sdk /dev/sdl /dev/sdm /dev/sdn --bitmap none

4-wide RAID 10

Next, we’ll switch to striped mirrors. In this configuration, we’ll be able to parallelize reads between all four drives, though RAID 10 incurs slightly more overhead than a dumb mirror for our reads.

And, when writing, we’ll see an improvement - not quite double the performance of a single drive, but we can effectively use our two striped pairs for writing at once.

mdadm --create /dev/md3 --level=10 --raid-devices=4 --layout=n2 --chunk=256 /dev/sdo /dev/sdp /dev/sdq /dev/sdr --bitmap none

4-wide RAID 0

Striping is great. We’ll get darn near four times the performance of a single disk in just about everything but Q1T1 random I/O.

Remember, everything that’s big enough is parallelized across as many drives as possible! RAID 0 is dead simple, and stupid fast as a result. If you don’t mind restoring from backup or need a cache, this is definitely the way to go.

mdadm --create /dev/md0 --level=0 --raid-devices=4 --chunk=256 /dev/sdc /dev/sdd /dev/sde /dev/sdf

4-wide RAID 5

Recall that RAID 5 is effectively a stripe with one set of parity data (erasure coding = XOR so we can figure out what we lost given the remaining data and parity bit). So we’re going to be able to highly parallelize our read operations a la RAID 0 (data blocks live on all four drives), but we’re going to be stuck with the performance of writing that one parity block.

And, while we’re here, let’s talk about write amplification. RAID 5 works by writing a full stripe, and a parity block for the stripe (say we have our four disks - we’ll have three data blocks and one parity block per stripe), to disk simultaneously. The parity block contains the necessary data to reconstruct the stripe if a drive holding data blocks is lost, but it corresponds to the whole stripe, not, for example, a single 4K block in the data chunk on one drive.

This means that, when we’re doing tiny I/Os (like our random 4K operations), to update the single block on disk, we must:

  1. Read the old data block
  2. Read the old parity block
  3. Calculate the new parity block (old parity XOR old data XOR new data)
  4. Write the new data block (including our changes)
  5. Write the new parity block

As you might guess, this is slow. Much slower than a mirror, stripe, single drive, or striped mirrors.

In my example, we have chunk=256. This means each chunk of the stripe is 256K, and, thus, to update a 4K block in a chunk we must read back the full 256K data block, read back the full 256K parity block, calculate the new parity block, then write the new 256K data and parity blocks to disk again. So this is terrible! We’ve got to read and write 1M (read 512K, then write 512K) to write 4K to disk!

mdadm --create /dev/md0 --level=5 --raid-devices=4 --chunk=256 --bitmap=none /dev/sdc /dev/sdd /dev/sde /dev/sdf

Contrast the random write performance, even with a high queue depth and a few threads, to the performance a RAID 10 array can put out. We see a (276K/58.3K) 4.7x improvement in 4 kilobyte random write I/O operations per second with RAID 10 on these four drives!

This is even worse on spinning rust.. but that’s a topic for another post. I have two disk shelves full of 2.5" spinners waiting for that :)

4-wide RAID 6

Recall that RAID 6 has the same characteristics as RAID 5, but with an extra parity block (storing the Q syndrome).

This results in higher overhead for writes (though this system is so fast that isn’t particularly apparent) and even greater write amplification on random I/O (though this is parallelized, so we don’t see a performance degradation in these simple tests).

A RAID 6 “small block write” goes something like:

  1. Read old data block
  2. Read the old P parity block
  3. Read the old Q parity block
  4. Calculate new P and Q
  5. Write new data block
  6. Write new P parity block
  7. Write new Q parity block

Again, as all three read/write operations are on different disks, these tests aren’t sufficient to properly demonstrate the impact of RAID 6 on random I/O performance. These PM5s are good enough to make the difference between RAID 5 and RAID 6 into noise in this relatively low load scenario.

With our 256K chunks, we’ve got to read and write a whopping 1.5M (768K read, 768K write) to write 4K to disk! Awful!

mdadm --create /dev/md1 --level=6 --raid-devices=4 --chunk=256 --bitmap=none /dev/sdg /dev/sdh /dev/sdi /dev/sdj

Note that this array and the RAID 5 array were on two different sets of four PM5s (I built four arrays at once, then let them sync simultaneously). I think the random I/O performance on the drives backing the RAID 6 array was particularly good. RAID 6 shouldn’t outperform RAID 5 in random I/O.

Summary table

Overall numbers

TestSingle drive4 drive RAID04 drive RAID14 drive RAID104 drive RAID54 drive RAID6
Read - SEQ1MQ8T8 MB/s111244424433413644384438
Read - SEQ1MQ8T1 MB/s111044294006411244364445
Write - SEQ1MQ8T8 MB/s112342119101648985951
Write - SEQ1MQ8T1 MB/s11234199102720981327947
Read - RND4KQ32T8 IOPS (MB/s)180K (739)663K (2590)713K (2786)709K (2902)651K (2668)644K (2639)
Read - RND4KQ32T1 IOPS (MB/s)180K (738)217K (890)202K (828)201K (825)217K (888)214K (875)
Read - RND4KQ1T1 IOPS (MB/s)17.3K (70.9)7.90K (32.4)7.60K (31.1)7.29K (29.8)8.70K (35.7)8.91K (36.4)
Write - RND4KQ32T8 IOPS (MB/s)169K (693)613K (2513)154K (632)276K (1130)58.3K (239)65.4K (268)
Write - RND4KQ32T1 IOPS (MB/s)173K (708)225K (922)91.7K (375)143K (586)68.5K (280)69.7K (285)
Write - RND4KQ1T1 IOPS (MB/s)29.5K (121)29.9K (122)24.5K (100)28.8K (118)5.54K (22.7)8.24K (33.7)

RND4KQ32T8 write

Note the dramatic difference between RAID 5 and RAID 0 - we’re at a measly 10% of the random write performance! Ouch!

ArrayIOPS (1,000)vs single diskvs RAID 0vs RAID 10
Single disk169100%28%61%
RAID 0613363%100%222%
RAID 115491%25%56%
RAID 10276163%45%100%
RAID 558.334%10%21%
RAID 665.439%11%24%

SEQ1MQ8T8 write

Note how RAID 1, 5 and 6 are limited to approximately the throughput of a single disk when writing.

ArrayMB/svs single diskvs RAID 0vs RAID 10
Single disk1123100%27%68%
RAID 04211375%100%256%
RAID 191081%22%55%
RAID 101648147%39%100%
RAID 598588%23%60%
RAID 695185%23%58%