Monthly Archives: June 2007

Where do NetApp’s hard technical problems come from?

In an earlier post I talked about the nature of NetApp’s hard problems, and I claimed that there were three factors:

  1. A basic technology that is incomplete
  2. A customer base willing to trade off features
  3. A customer base willing to pay for those features

In this post I’ll try and give some detail about 1 and 2.

For NetApp the basic technologies that have been driving our innovation, which is the fancy word for saying the set of hard problems that we’ve solved, has been and continues to be networks, storage media and commodity computing.

Back in the day when NetApp was founded the traditional computing system consisted of a CPU, RAM, some input and output devices and some form of stable storage. This form of computing is still how desktop PC’s and laptops are built. However, in the data center traditional computing systems have changed dramatically.

As an aside, data center is a terms that is used to describe the set of computers that are not used for personal computing but are a shared computing resource across a company or institution. Normally we associate the term data center with the enterprise, but really any company that has a shared computing resource (such as email or file serving or print serving) has a data center and this discussion applies to them as well as to the Fortune 500.

What caused that change was networking speeds, and commodity computing.

The traditional computer system made a lot of sense because of the ratios of performance between the components. Every normal application assumes that RAM has a fast uniform access and that storage has a predictable slow access. The performance of the application is a function of the speed of the CPU and the speed with which you can get data to and from RAM and to and from stable storage. Now it turns out that RAM and stable storage are much slower than CPU’s. Caching and clever algorithms are used to improve the performance of applications by trying to hide the latency of both RAM and stable storage. For storage, I’ll just state those algorithms were in the VM, file system, volume manager and RAID subsystem.

Now it turns out that the algorithms that were used to improve disk performance were executing on the same CPU that the application itself was running. Worse, the storage sub-system was competing with the application for the far more scarce resource of memory bandwidth. As the application demands for more CPU and memory bandwidth increased, the CPU cycles that were being consumed by the storage system were critically looked at and reasonable people started to ask whether the storage system really did require so many CPU cycles. In fact, some folks actually believed that the existence of general purpose storage sub-systems was the source of the performance problem. They therefore argued for eliminating all of those clever file systems and replacing them with custom per application solutions. The problem with that approach was that no one wanted to write their own file system, volume manager and RAID subsystem.

In software computer science every problem can be solved with a layer of indirection. In hardware computer science, every performance problem with a dedicated computing element.

The computer industry (and the founders of NetApp in particular) observed that there was a layer of indirection in UNIX between the storage sub-system and the rest of the computing system, and that was the VFS layer and NFS client. They also observed that because Ethernet network speed was increasing the storage subsystem could be moved onto it’s own dedicated computing element. In effect, the speed of the network was not an issue when it came to the predictability or slowness of the storage. Or more precisely by moving the storage sub-system out of the main computer they could use more computing and memory resources to compensate for any increased latency caused by the stable storage no longer being directly attached to the local shared bus. In fact in the 1990’s NetApp used to remark that our storage systems were faster than local disks. They further observed that the trends of commodity CPUs allowed them to build their dedicated computing element out of commodity parts this made it cost effective to build such a computing element. Designing your own ASIC is absurdly expensive.

Now it also turned out that putting the data on the network had some additional benefits beyond just performance. But it was those networking and CPU and disk drive technology trends that enabled the independence of storage subsystems.

It’s almost too obvious to point out, but you can not just attach an Ethernet cable to CPU a to disk drive and have networked storage. In fact, the challenge we have at NetApp is how to combine those components into a product that adds value. In effect the source of all of our hard problems is how to write clever algorithms that exploit the attributes and hide the limitations of disks to add value to applications that require stable storage within a particular cost structure (CPU and RAM and Disk).

Which gets me to item 2, trade-offs, of my list. If you have an infinite budget, you could construct a stable storage system that had enough memory to cache your entire dataset in battery backed RAM. You could imagine that periodically some of the data would be flushed to disk. Such a storage subsystem would be fairly simple to construct but would be ridiculously expensive. In effect, insufficient customers would pay for it.

In effect, customers want a certain amount of performance that fits into their budget. The trick is how to deliver that performance. And the performance it turns out is not just about how fast you perform read and write operations, but in fact encompasses all of the tasks you need to perform with stable storage. And this where things get messy.

Performance for a storage sub-system is of course about how fast you can get at the data, but also how fast you can back up the data and how fast you can restore your data and how fast you can replicate the data to a remote site in case of a disaster. And it turns out that for many customers those other factors are important enough that they are willing to trade off some performance for read and write if they can get faster backups, restores and replication. And it further turns out that for many customers the performance of an operation is also a function of the ease to perform said operation. For example, if a restore takes 3 minutes to perform, but requires 8 hours to setup before you can hit the restore command, customer understand that the performance is really 8 hours and 3 minutes.

So really performance is a function of raw read and write, speed of backup, restore and replication and ease of use.

It turns out that if you optimize for any one of those vectors exclusively you will fail in the market place. To succeed you have to trade-off time and energy for one in favor of the other.

So where do the hard problems come from at NetApp?

  1. Building high performance storage subsystem that is reliable. This, in many ways is a canonical file system, volume manager and RAID level problem however because we are dedicate storage sub-system we have other specific challenges.
  2. Building efficient mechanisms for replication and backup and restore that unless you are careful can affect 1. This is a unique area and is relatively new in the storage industry. Although replication has existed for a while, understanding how backup and DR should be optimally done is a not yet fully understood.
  3. Building a simple storage system. For NetApp a key value proposition is that the total cost of ownership of our devices is lower than our competitors. It turns out that simplicity is a challenge not only for one storage subsystem but also when you have several hundred but I’ll talk about that in a later post.

So now I’ve hopefully explained where our hard problems comes from. In my next posts I’ll discuss each of these sub-bullets in more detail.

On the nature of our hard problems …

In my post about why you should work at NetApp, I described the four fundamental reasons as

  1. Work on something important
  2. Work on hard problems
  3. Work with intelligent people
  4. Have your contribution matter

I explained in an earlier post why what I do is important to our customers.

So now let me tackle the question of hard problems. In this post, I’ll limit myself to defining the general nature of what a hard problem for a company like NetApp is. In later posts, I’ll get into more specifics about the kinds of hard problems we work on.
The first thing to is define a hard problem. A traditional definition of a hard problem is:

A problem is fundamentally hard if no solution at any cost is known to exist, and previous attempts at solving the problem resulted in failure. A problem may be impossible if no solution exists but we will assume that for the purpose of this post, a problem is hard if and only if a solution exists. This class of problems is typically the area of basic research.

At a company like Network Appliance, we do not typically explore problems in this space although we have in specific areas over the past 15 years. Basic Research is just not our focus. If you are interested in working on these kinds of problems, my recommendation is get a Ph.D. in Computer Science and then find an academic or research lab position.

The nature of hard problems that NetApp engineering works on fit into the following bucket:

There exists some basic technology that offers some compelling features to a user but does not completely satisfy the requirements of the user. The user is willing to pay for the basic technology. The user is willing to trade-off some features for other features.

To understand how this applies to NetApp I need to explain a whole bunch of things. The first is the nature of the basic technologies that we rely on and how they influence us. The second is why the user wants to use that technology. The third is to explain how the basic technology can not meet the requirements of the user. The fourth that there is an opportunity to build interesting products that can satisfy the requirements of the consumer.

Once I’ve explained those four things, I can explain in more detail where are our hard problems lie.

But I’ll leave that for another post….

Updated with some cleaned up grammar.