June 26th, 2009

Previous Entry Next Entry
07:19 pm - More ZFS performance data
Update to the prior ZFS post: I did some more benchmarks. The newer set was performed with one 5MB burst of data written to a random location on the target drive each second, which is probably a better model for most real-world conditions.

Methodology: /data/5M and /data/1M have 5GB of data each in randomly-ordered chunks 5MB and 1MB in size, respectively. /data/zero.bin is a contiguous 8GB file. A process writes a burst of 5MB to a random location in /data/zero.bin once per second; other processes read chunks from /data/1M or /data/5M as appropriate (and as fast as possible) until the entire 5G dataset is read.

In between runs, 5GB of data is read from another file in /home/myname (or /export/home/myname, in the case of OpenSolaris) in order to flush as much of the filesystem cache out as possible. It's worth noting that, as far as I can tell, at least some of the OSes tested have some protection against a single file dominating the entire available cache in RAM - so differences in cache algorithm WILL have had some effect on these results! I could not find any programmatic way to dump the filesystem cache, so the big copy operation from another drive was the best I could come up with.

The methodology is certainly not perfect, but I feel it probably models a lot of real-world conditions reasonably well.

The test machine is an Athlon64 3500+ with 2GB of DDR2 SDRAM, on a motherboard using the nVidia SATA chipset - which means no NCQ (Native Command Queueing) support for the SATA drives. Operating system is installed on a Western Digital 250GB drive. Data drives are five Seagate ST3750640NS SATA-II 750GB drives. NOTE: these Seagates are SLOW PIGS, which is why these performance numbers are so low - and part of why I included baseline single-drive performance numbers!

Operating systems tested were FreeBSD 7.2-RELEASE amd64, Ubuntu Server 8.04-LTS amd64, and OpenSolaris 0906.

There were a couple of odd quirks that you can't see in the graphs - OpenSolaris hangs onto writes for a WHILE before committing them; the OpenSolaris read/write numbers are a little misleading because the array would keep chattering with late write commits for quite some time after each benchmark run was nominally complete. Since ZFS writes atomically, and most real-world loads aren't likely to saturate the array 24/7, I decided to leave the numbers as-is - that's the performance you'd experience actually using the array, after all, since a read to any of those uncommitted writes would still produce the new data from memory rather than the old data from the spindles.

Also, whereas for Linux and FreeBSD I would first write 5GB of data from /dev/urandom to /home/myname, and then parcel that out in chunks to /data/5M and /data/1M, on OpenSolaris I ended up actually doing the writes directly from /dev/urandom instead... because I discovered the hard way that if you read-saturate from /export/home, the entire damn system becomes just about completely unusable until you quit. And it was actually SLOWER reading from /export/home than from /dev/urandom directly!

I thought it was pretty interesting that FreeBSD's ZFS implementation and OpenSolaris' performed so differently - one better at larger reads, one better at smaller reads. It's probably worth noting that FreeBSD 7.2-RELEASE uses ZFS v6, whereas OpenSolaris 0906 uses a much later version - v13 or so, I think, although I couldn't find a definitive answer anywhere. It's probably also significant that OpenSolaris was the only OS running a big heavy desktop - neither FreeBSD nor Ubuntu was running X, but OpenSolaris was burdened down under X, Gnome, and all kinds of desktop-ish crap. That was probably a pretty big limitation on a 2GB machine.

There's not much doubt that, at least on this scale of hardware, mdraid5 is the performance king - but ZFS has a hell of a lot to offer in exchange for what performance you might give up. Anybody who's ever spent 10+ hours doing an offline fsck of a large raid array is probably quite willing to sacrifice a little bit of read performance in exchange for a promise to NEVER have to do another offline fsck again. The availability of live compressed volumes, copy-on-write, and other features - and how brain-dead easy they are to implement from the admin's perspective - is also pretty compelling.


(1 comment | Leave a comment)


[User Picture] From: discogravy
Date: June 27th, 2009 - 02:47 am
solaris does the delayed write thing to speed up write performance and to help avoid disk thrashing (although in most modern implementations i think it's kinda pointless). there's a way to change that via the solaris equiv of sysctl but i have forgotten it.

Presumably the ZFS can be tuned like other filesystems for large or small write-blocks; I am pretty sure what you're seeing in the implementations is just this and that if you created the ZFS disks with the same block-size you'd get closer parity.

Aside: you can kill the X/desktop in Solaris by going into init 3, just like in old-timey linux.

I would recommend that you put this up in a pretty webpage (and maybe a PDF) and get some links/discussion from some of the nerd news sites -- you can't be the only dork with serious interest in this (and avoiding offline fsck'ing)

> Go to Top