Category Archives: netapp

Time to invent a file system

In 2006, while at NetApp, I remember with horror the launch of Isilon.

Isilon’s product was everything Clustered ONTAP – aka GX – wanted to be.

One of the most intriguing aspects of the product was the use of Reed-Solomon codes to cut the amount of storage required. The downside, of course, was that rebuild was a bitch. The rebuild was so painful, that although the tech was interesting, our senior most architects were dismissive of the value.

They believed that the clustered storage solution and a clustered file system would deliver superior availability with better cost and faster rebuilds. Or something like that, I must admit that I have forgotten the details of the debates and don’t feel like pulling remembering everything.

The market failure of Reed Solomon codes, more or less convinced me that the right answer for the foreseeable future was 2x the storage costs.

And then I read this:

http://storagemojo.com/2013/06/21/facebooks-advanced-erasure-codes/

That is a nice summary of this paper: http://anrg.usc.edu/~maheswaran/Xorbas.pdf

This is a huge result. What it suggests is that storage availability is no longer tied  to 2x the storage infrastructure without taking an unacceptable hit on recovery.

A new file system that embraces this kind of encoding could be a good solution for a large class of applications that don’t need the RTO of 2x the storage. Making storage cheaper has always been a winning strategy for growing market share.

A new clustered file system built around this kind of erasure code or even a variety of erasure codes could be a significant new addition to the tech eco-system.

I wonder if something built ground up would look very different from adapting an existing system.

 

Software Defined Storage – laying claims to being a visionary ;-)

After the recent valuations associated to Software Defined Networking startup, storage companies have decided to get on the band-wagon.

Proving that there is nothing new under the sun, I wanted to lay claim to having been a visionary in the space 😉

And for the record, much of this work would have not been possible without an extremely talented set of senior architects, in particular Jim Voll.

In 2003 Steve Kleiman, then CTO of NetApp, brilliantly noted that storage was a physical system that was going to turn into a software system. And that managing the software was going to be the problem in the virtualized data center.

He was right. And I spent 4 years trying to understand his insight. In 2008, shortly before I left NetApp, I got it…

But then I decided to go work on games.

Because a claim without proof is just a claim, let me refer to two papers I wrote.

The first describes a way to do software defined storage for the problem of managing storage withstorage of data replication:

https://www.usenix.org/conference/lisa-07/policy-driven-management-data-sets

The second describes the problem that increasing software virtualization of storage was going to create and the need for a new paradigm for management.

http://delivery.acm.org/10.1145/1320000/1317404/p38-roussos.pdf

NetApp then delivered a product, Provisioning Manager, which implemented a lot of these ideas.

In both of these articles, I called for re-thinking storage management from a hardware system to a software system and proposed an approach that would allow humans to manage the complexity.

Nice to see the world catching up 🙂

WAFL Performance: Making writes go faster with fewer IOPS

cross posting from my corporate blog

Like every storage array, Data ONTAP attempts to make the latency of the average read and write operation be lower than the actual latency of the underlying disk drive.

Unlike Real FiberChannel systems, how Data ONTAP does achieves this is quite unique.

The Latency Challenge

If you’re a storage array life looks like a set of random read and write operations that you have to immediately respond to. The temptation is to take a request and immediately go to do disk to get it as fast as you possibly can.

But that’s kind-a-stupid.

When you look at read operations you realize that they tend to be clustered around the same piece of data. So although an application may be asking for one 4k block of data, the reality is that the next piece of data it will ask for is probably sequentially next to the first 4k block of data, or is in fact the same 4k block of data!  Given that, it makes sense to do two things:

  1. Keep the 4k around in memory
  2. Get more than 4k when you go to disk so that the next request is serviced from memory and not disk.

In effect by using memory you can reduce what would have been three IOPS into one actual disk IO.

Write operations are a little bit more complicated. Obviously when you acknowledge a write you’re making a promise to the application that the data is committed to stable storage. Superficially this would imply that every write operation has to go to disk since that’s the only available non-volatile storage.

image

In this picture each box represents a distinct write operations. The letters designate the order in which they were received. If the storage array were to write to the data to disk as they arrived, then the storage array would perform 8 IOs. Furthermore, the latency of each IO would be the latency of the disk drive itself.

Enter battery backed memory…

Except it turns out that we can turn DRAM into non-volatile memory if we’re willing to keep it powered!

What that means is that once the data is in memory, and the storage array is sure that data will eventually make to disk, the write can be acknowledged. Now that the write operations are in memory it’s possible to schedule the write operations so that they are done sequentially.

image

In this picture the order in which the write operations are committed to disk is a->b->h->i->e->f->c->d->g even though the order that the application thinks they were committed to disk is a->b->c->d->e->f->g->h->i. Because the array re-orders the write operations, they can be written in one single IO, effectively a single sequential write operation.

The challenge is that you need to buffer enough write operations to make this efficient.

But WAFL, of course, is different

What I just described is how Real FiberChannel works. Effectively the location of the blocks (a,b,c, etc) are fixed on disk. All the storage array can do is re-order when you write them to their fixed locations.

What WAFL does is determine where the data has to go when the data is ready to be written. So using the same picture:

image

Rather than trying to put each block (a, b, c, d etc) in some pre-arranged location, WAFL finds the first bit of available space and writes the data sequentially. Like Real FiberChannel, Better than Real Fiber Channel will transform the random write operations into sequential operations reducing the total number of IOPS required.

Now what’s interesting is that WAFL doesn’t require that much memory to buffer the writes. And what’s even more interesting is because the fixed overhead to flush data to disk is negligible, there is no real value in holding onto the data in memory for very long.

Read and Write and IOPS

If you’ve been following this story, you’ll have figured out something very interesting: a write operation can be deferred but a read operation must go to disk. In other words, the array can choose when it goes to disk to commit a write, but if the data is not in memory, the array must go right then and there to get the data from disk!

So what makes a disk behave in a random fashion? Why the read operations, because the writes are always done sequentially!

So why do you need a lot of IOPS? Not to service the writes, because those are being written sequentially, but to service the read operations. The more read operations the more IOPS. In fact if you are doing truly sequential write operations, then you don’t need that many IOPS ….

But it’s read operations to DISK not read operations in general!

Aha! Now I get it…

The reason the PAM card is only a read cache is that adding more memory for writes doesn’t improve WAFL write performance… We’re writing as fast as we can write data sequentially to disk.

But adding the PAM card absorbs more read operations which reduces the number of IOPS that the storage system requires. And the best way to express reduced number of IOPS is requiring both fewer and slower disks!

If you can’t comment because of  an inability to get past the CAPTCHA, try a different browser. I’ve had success with IE. We’re having some problems right now and hopefully they will get fixed soon.

Where do NetApp’s hard technical problems come from?

In an earlier post I talked about the nature of NetApp’s hard problems, and I claimed that there were three factors:

  1. A basic technology that is incomplete
  2. A customer base willing to trade off features
  3. A customer base willing to pay for those features

In this post I’ll try and give some detail about 1 and 2.

For NetApp the basic technologies that have been driving our innovation, which is the fancy word for saying the set of hard problems that we’ve solved, has been and continues to be networks, storage media and commodity computing.

Back in the day when NetApp was founded the traditional computing system consisted of a CPU, RAM, some input and output devices and some form of stable storage. This form of computing is still how desktop PC’s and laptops are built. However, in the data center traditional computing systems have changed dramatically.

As an aside, data center is a terms that is used to describe the set of computers that are not used for personal computing but are a shared computing resource across a company or institution. Normally we associate the term data center with the enterprise, but really any company that has a shared computing resource (such as email or file serving or print serving) has a data center and this discussion applies to them as well as to the Fortune 500.

What caused that change was networking speeds, and commodity computing.

The traditional computer system made a lot of sense because of the ratios of performance between the components. Every normal application assumes that RAM has a fast uniform access and that storage has a predictable slow access. The performance of the application is a function of the speed of the CPU and the speed with which you can get data to and from RAM and to and from stable storage. Now it turns out that RAM and stable storage are much slower than CPU’s. Caching and clever algorithms are used to improve the performance of applications by trying to hide the latency of both RAM and stable storage. For storage, I’ll just state those algorithms were in the VM, file system, volume manager and RAID subsystem.

Now it turns out that the algorithms that were used to improve disk performance were executing on the same CPU that the application itself was running. Worse, the storage sub-system was competing with the application for the far more scarce resource of memory bandwidth. As the application demands for more CPU and memory bandwidth increased, the CPU cycles that were being consumed by the storage system were critically looked at and reasonable people started to ask whether the storage system really did require so many CPU cycles. In fact, some folks actually believed that the existence of general purpose storage sub-systems was the source of the performance problem. They therefore argued for eliminating all of those clever file systems and replacing them with custom per application solutions. The problem with that approach was that no one wanted to write their own file system, volume manager and RAID subsystem.

In software computer science every problem can be solved with a layer of indirection. In hardware computer science, every performance problem with a dedicated computing element.

The computer industry (and the founders of NetApp in particular) observed that there was a layer of indirection in UNIX between the storage sub-system and the rest of the computing system, and that was the VFS layer and NFS client. They also observed that because Ethernet network speed was increasing the storage subsystem could be moved onto it’s own dedicated computing element. In effect, the speed of the network was not an issue when it came to the predictability or slowness of the storage. Or more precisely by moving the storage sub-system out of the main computer they could use more computing and memory resources to compensate for any increased latency caused by the stable storage no longer being directly attached to the local shared bus. In fact in the 1990’s NetApp used to remark that our storage systems were faster than local disks. They further observed that the trends of commodity CPUs allowed them to build their dedicated computing element out of commodity parts this made it cost effective to build such a computing element. Designing your own ASIC is absurdly expensive.

Now it also turned out that putting the data on the network had some additional benefits beyond just performance. But it was those networking and CPU and disk drive technology trends that enabled the independence of storage subsystems.

It’s almost too obvious to point out, but you can not just attach an Ethernet cable to CPU a to disk drive and have networked storage. In fact, the challenge we have at NetApp is how to combine those components into a product that adds value. In effect the source of all of our hard problems is how to write clever algorithms that exploit the attributes and hide the limitations of disks to add value to applications that require stable storage within a particular cost structure (CPU and RAM and Disk).

Which gets me to item 2, trade-offs, of my list. If you have an infinite budget, you could construct a stable storage system that had enough memory to cache your entire dataset in battery backed RAM. You could imagine that periodically some of the data would be flushed to disk. Such a storage subsystem would be fairly simple to construct but would be ridiculously expensive. In effect, insufficient customers would pay for it.

In effect, customers want a certain amount of performance that fits into their budget. The trick is how to deliver that performance. And the performance it turns out is not just about how fast you perform read and write operations, but in fact encompasses all of the tasks you need to perform with stable storage. And this where things get messy.

Performance for a storage sub-system is of course about how fast you can get at the data, but also how fast you can back up the data and how fast you can restore your data and how fast you can replicate the data to a remote site in case of a disaster. And it turns out that for many customers those other factors are important enough that they are willing to trade off some performance for read and write if they can get faster backups, restores and replication. And it further turns out that for many customers the performance of an operation is also a function of the ease to perform said operation. For example, if a restore takes 3 minutes to perform, but requires 8 hours to setup before you can hit the restore command, customer understand that the performance is really 8 hours and 3 minutes.

So really performance is a function of raw read and write, speed of backup, restore and replication and ease of use.

It turns out that if you optimize for any one of those vectors exclusively you will fail in the market place. To succeed you have to trade-off time and energy for one in favor of the other.

So where do the hard problems come from at NetApp?

  1. Building high performance storage subsystem that is reliable. This, in many ways is a canonical file system, volume manager and RAID level problem however because we are dedicate storage sub-system we have other specific challenges.
  2. Building efficient mechanisms for replication and backup and restore that unless you are careful can affect 1. This is a unique area and is relatively new in the storage industry. Although replication has existed for a while, understanding how backup and DR should be optimally done is a not yet fully understood.
  3. Building a simple storage system. For NetApp a key value proposition is that the total cost of ownership of our devices is lower than our competitors. It turns out that simplicity is a challenge not only for one storage subsystem but also when you have several hundred but I’ll talk about that in a later post.

So now I’ve hopefully explained where our hard problems comes from. In my next posts I’ll discuss each of these sub-bullets in more detail.

On the nature of our hard problems …

In my post about why you should work at NetApp, I described the four fundamental reasons as

  1. Work on something important
  2. Work on hard problems
  3. Work with intelligent people
  4. Have your contribution matter

I explained in an earlier post why what I do is important to our customers.

So now let me tackle the question of hard problems. In this post, I’ll limit myself to defining the general nature of what a hard problem for a company like NetApp is. In later posts, I’ll get into more specifics about the kinds of hard problems we work on.
The first thing to is define a hard problem. A traditional definition of a hard problem is:

A problem is fundamentally hard if no solution at any cost is known to exist, and previous attempts at solving the problem resulted in failure. A problem may be impossible if no solution exists but we will assume that for the purpose of this post, a problem is hard if and only if a solution exists. This class of problems is typically the area of basic research.

At a company like Network Appliance, we do not typically explore problems in this space although we have in specific areas over the past 15 years. Basic Research is just not our focus. If you are interested in working on these kinds of problems, my recommendation is get a Ph.D. in Computer Science and then find an academic or research lab position.

The nature of hard problems that NetApp engineering works on fit into the following bucket:

There exists some basic technology that offers some compelling features to a user but does not completely satisfy the requirements of the user. The user is willing to pay for the basic technology. The user is willing to trade-off some features for other features.

To understand how this applies to NetApp I need to explain a whole bunch of things. The first is the nature of the basic technologies that we rely on and how they influence us. The second is why the user wants to use that technology. The third is to explain how the basic technology can not meet the requirements of the user. The fourth that there is an opportunity to build interesting products that can satisfy the requirements of the consumer.

Once I’ve explained those four things, I can explain in more detail where are our hard problems lie.

But I’ll leave that for another post….

Updated with some cleaned up grammar.