Where do NetApp’s hard technical problems come from?

In an earlier post I talked about the nature of NetApp’s hard problems, and I claimed that there were three factors:

  1. A basic technology that is incomplete
  2. A customer base willing to trade off features
  3. A customer base willing to pay for those features

In this post I’ll try and give some detail about 1 and 2.

For NetApp the basic technologies that have been driving our innovation, which is the fancy word for saying the set of hard problems that we’ve solved, has been and continues to be networks, storage media and commodity computing.

Back in the day when NetApp was founded the traditional computing system consisted of a CPU, RAM, some input and output devices and some form of stable storage. This form of computing is still how desktop PC’s and laptops are built. However, in the data center traditional computing systems have changed dramatically.

As an aside, data center is a terms that is used to describe the set of computers that are not used for personal computing but are a shared computing resource across a company or institution. Normally we associate the term data center with the enterprise, but really any company that has a shared computing resource (such as email or file serving or print serving) has a data center and this discussion applies to them as well as to the Fortune 500.

What caused that change was networking speeds, and commodity computing.

The traditional computer system made a lot of sense because of the ratios of performance between the components. Every normal application assumes that RAM has a fast uniform access and that storage has a predictable slow access. The performance of the application is a function of the speed of the CPU and the speed with which you can get data to and from RAM and to and from stable storage. Now it turns out that RAM and stable storage are much slower than CPU’s. Caching and clever algorithms are used to improve the performance of applications by trying to hide the latency of both RAM and stable storage. For storage, I’ll just state those algorithms were in the VM, file system, volume manager and RAID subsystem.

Now it turns out that the algorithms that were used to improve disk performance were executing on the same CPU that the application itself was running. Worse, the storage sub-system was competing with the application for the far more scarce resource of memory bandwidth. As the application demands for more CPU and memory bandwidth increased, the CPU cycles that were being consumed by the storage system were critically looked at and reasonable people started to ask whether the storage system really did require so many CPU cycles. In fact, some folks actually believed that the existence of general purpose storage sub-systems was the source of the performance problem. They therefore argued for eliminating all of those clever file systems and replacing them with custom per application solutions. The problem with that approach was that no one wanted to write their own file system, volume manager and RAID subsystem.

In software computer science every problem can be solved with a layer of indirection. In hardware computer science, every performance problem with a dedicated computing element.

The computer industry (and the founders of NetApp in particular) observed that there was a layer of indirection in UNIX between the storage sub-system and the rest of the computing system, and that was the VFS layer and NFS client. They also observed that because Ethernet network speed was increasing the storage subsystem could be moved onto it’s own dedicated computing element. In effect, the speed of the network was not an issue when it came to the predictability or slowness of the storage. Or more precisely by moving the storage sub-system out of the main computer they could use more computing and memory resources to compensate for any increased latency caused by the stable storage no longer being directly attached to the local shared bus. In fact in the 1990’s NetApp used to remark that our storage systems were faster than local disks. They further observed that the trends of commodity CPUs allowed them to build their dedicated computing element out of commodity parts this made it cost effective to build such a computing element. Designing your own ASIC is absurdly expensive.

Now it also turned out that putting the data on the network had some additional benefits beyond just performance. But it was those networking and CPU and disk drive technology trends that enabled the independence of storage subsystems.

It’s almost too obvious to point out, but you can not just attach an Ethernet cable to CPU a to disk drive and have networked storage. In fact, the challenge we have at NetApp is how to combine those components into a product that adds value. In effect the source of all of our hard problems is how to write clever algorithms that exploit the attributes and hide the limitations of disks to add value to applications that require stable storage within a particular cost structure (CPU and RAM and Disk).

Which gets me to item 2, trade-offs, of my list. If you have an infinite budget, you could construct a stable storage system that had enough memory to cache your entire dataset in battery backed RAM. You could imagine that periodically some of the data would be flushed to disk. Such a storage subsystem would be fairly simple to construct but would be ridiculously expensive. In effect, insufficient customers would pay for it.

In effect, customers want a certain amount of performance that fits into their budget. The trick is how to deliver that performance. And the performance it turns out is not just about how fast you perform read and write operations, but in fact encompasses all of the tasks you need to perform with stable storage. And this where things get messy.

Performance for a storage sub-system is of course about how fast you can get at the data, but also how fast you can back up the data and how fast you can restore your data and how fast you can replicate the data to a remote site in case of a disaster. And it turns out that for many customers those other factors are important enough that they are willing to trade off some performance for read and write if they can get faster backups, restores and replication. And it further turns out that for many customers the performance of an operation is also a function of the ease to perform said operation. For example, if a restore takes 3 minutes to perform, but requires 8 hours to setup before you can hit the restore command, customer understand that the performance is really 8 hours and 3 minutes.

So really performance is a function of raw read and write, speed of backup, restore and replication and ease of use.

It turns out that if you optimize for any one of those vectors exclusively you will fail in the market place. To succeed you have to trade-off time and energy for one in favor of the other.

So where do the hard problems come from at NetApp?

  1. Building high performance storage subsystem that is reliable. This, in many ways is a canonical file system, volume manager and RAID level problem however because we are dedicate storage sub-system we have other specific challenges.
  2. Building efficient mechanisms for replication and backup and restore that unless you are careful can affect 1. This is a unique area and is relatively new in the storage industry. Although replication has existed for a while, understanding how backup and DR should be optimally done is a not yet fully understood.
  3. Building a simple storage system. For NetApp a key value proposition is that the total cost of ownership of our devices is lower than our competitors. It turns out that simplicity is a challenge not only for one storage subsystem but also when you have several hundred but I’ll talk about that in a later post.

So now I’ve hopefully explained where our hard problems comes from. In my next posts I’ll discuss each of these sub-bullets in more detail.

Leave a Reply