cross posting from my corporate blog
Like every storage array, Data ONTAP attempts to make the latency of the average read and write operation be lower than the actual latency of the underlying disk drive.
Unlike Real FiberChannel systems, how Data ONTAP does achieves this is quite unique.
The Latency Challenge
If you’re a storage array life looks like a set of random read and write operations that you have to immediately respond to. The temptation is to take a request and immediately go to do disk to get it as fast as you possibly can.
But that’s kind-a-stupid.
When you look at read operations you realize that they tend to be clustered around the same piece of data. So although an application may be asking for one 4k block of data, the reality is that the next piece of data it will ask for is probably sequentially next to the first 4k block of data, or is in fact the same 4k block of data! Given that, it makes sense to do two things:
- Keep the 4k around in memory
- Get more than 4k when you go to disk so that the next request is serviced from memory and not disk.
In effect by using memory you can reduce what would have been three IOPS into one actual disk IO.
Write operations are a little bit more complicated. Obviously when you acknowledge a write you’re making a promise to the application that the data is committed to stable storage. Superficially this would imply that every write operation has to go to disk since that’s the only available non-volatile storage.
In this picture each box represents a distinct write operations. The letters designate the order in which they were received. If the storage array were to write to the data to disk as they arrived, then the storage array would perform 8 IOs. Furthermore, the latency of each IO would be the latency of the disk drive itself.
Enter battery backed memory…
Except it turns out that we can turn DRAM into non-volatile memory if we’re willing to keep it powered!
What that means is that once the data is in memory, and the storage array is sure that data will eventually make to disk, the write can be acknowledged. Now that the write operations are in memory it’s possible to schedule the write operations so that they are done sequentially.
In this picture the order in which the write operations are committed to disk is a->b->h->i->e->f->c->d->g even though the order that the application thinks they were committed to disk is a->b->c->d->e->f->g->h->i. Because the array re-orders the write operations, they can be written in one single IO, effectively a single sequential write operation.
The challenge is that you need to buffer enough write operations to make this efficient.
But WAFL, of course, is different
What I just described is how Real FiberChannel works. Effectively the location of the blocks (a,b,c, etc) are fixed on disk. All the storage array can do is re-order when you write them to their fixed locations.
What WAFL does is determine where the data has to go when the data is ready to be written. So using the same picture:
Rather than trying to put each block (a, b, c, d etc) in some pre-arranged location, WAFL finds the first bit of available space and writes the data sequentially. Like Real FiberChannel, Better than Real Fiber Channel will transform the random write operations into sequential operations reducing the total number of IOPS required.
Now what’s interesting is that WAFL doesn’t require that much memory to buffer the writes. And what’s even more interesting is because the fixed overhead to flush data to disk is negligible, there is no real value in holding onto the data in memory for very long.
Read and Write and IOPS
If you’ve been following this story, you’ll have figured out something very interesting: a write operation can be deferred but a read operation must go to disk. In other words, the array can choose when it goes to disk to commit a write, but if the data is not in memory, the array must go right then and there to get the data from disk!
So what makes a disk behave in a random fashion? Why the read operations, because the writes are always done sequentially!
So why do you need a lot of IOPS? Not to service the writes, because those are being written sequentially, but to service the read operations. The more read operations the more IOPS. In fact if you are doing truly sequential write operations, then you don’t need that many IOPS ….
But it’s read operations to DISK not read operations in general!
Aha! Now I get it…
The reason the PAM card is only a read cache is that adding more memory for writes doesn’t improve WAFL write performance… We’re writing as fast as we can write data sequentially to disk.
But adding the PAM card absorbs more read operations which reduces the number of IOPS that the storage system requires. And the best way to express reduced number of IOPS is requiring both fewer and slower disks!
If you can’t comment because of an inability to get past the CAPTCHA, try a different browser. I’ve had success with IE. We’re having some problems right now and hopefully they will get fixed soon.