Category Archives: technology

PRISM – Where are the servers, revisions

In my analysis I claimed that the NSA had to buy lots and lots of servers.  Something like 1 million.

What friends of mine, correctly, pointed out is that most servers render data and do not storie data.

Which, of course, reveals my bias. At Zynga we don’t render the data for the user, we just process the data. . At other web companies most servers render the thing the user sees.

Mea Culpa. 

The reality is that of the 1 million servers, for many web properties, only a fraction stores data. So let’s say 10% which is probably fair. That reduces the problem to 100k servers.

Except …

FB, Yahoo and Google are probably just one of the interesting places people store data.

They also store data on Box, DropBox, S3, EBS, tumblr, etc, etc, etc. Any application that stores data for sharing is a target for the NSA.

They also store data in Hotmail (now known as Outlook… Really … )…

The point is that you can easily shrink the problem down, and then I can easily grow it.

And then the interesting problem that the folks at the NSA have to solve isn’t just storing the data but finding connections across the data. Just the size of the data motion and data indexing boggles the mind.

The point is that this is a huge infrastructure.

And that the problems of management, scaling, operations remain real even before we get to the really interesting question of data analysis.

Now it’s entirely possible that there are researchers in the NSA that have solved all of big-data’s problems that the rest of us are working on. It’s possible.

And unicorns might exist.

Look if this is real, it means that my understanding of where the state of the art is, is about 10 years behind the curve. And if the US government has sat on this kind of advanced software, then the entire decade we spent figuring this shit out was … wasted.

And if they are that good, that means that entire areas of human endeavor could be accelerated if they gave that software away. Just think about what we could do with the kind of real-time analysis. What would we do if we could sift through all the data about all of humanity in real-time …

At the end of the day, I am having a hard time believing that the rest of the planet is 12 years behind some mysterious dark organization.

Is it possible? Absolutely. Likely, no.

PRISM – Person of Interest meets Reality Show

Everyone gets their 15 minutes of fame. The NSA dude, Edward Snowden, got more than his fair share.

What I found fascinating was his description of how the system will create connections from things you have done and create an artificial and suspicious narrative.

As I listened to him talk, I remembered where I had seen this theory before — it’s in a TV show called Person of Interest.

The central conceit of the show is that there is a computer that has access to every data source on the planet and is making connections and finding bad guys before anyone else can .

Which made me laugh. Occam’s razor says that the simplest solution is the likeliest. So what’s more likely some low-level person invented some impossible to disprove conspiracy theory based on a hit TV show OR the NSA is monitoring computers systems without the smartest minds of my generation figuring it out.

When this is all said and done, we will discover that these are the false claims made by a media hungry person. And we will also, rediscover that press’ technology literacy is abysmal.

PRISM – Where are the frigging servers, part deux…

In my last post, I asked the question “where are the servers”. And, of course, folks sent me links to the Utah data center.

Good response, but I was trying to go somewhere else… Teach me to bury my lead.

Finding a physical place for a 1,000,000 servers is easy if you are the US government. We have a lot of space that the government owns that it can use.

The problem is more about how in God’s name is the government buying, managing and using 1,000,000 servers.

The scale of the equipment required, and challenge managing that scale is mind-boggling given that it would dwarf the hardest systems commercially built.

Buying

Look, the US government for it’s really big super-computers relies on outside contractors. They don’t have the in-house skill to build one of these things.

And the scale of the equipment would make the US government an insanely huge part of the tech market and that is mind-boggling. Basically for every server CPU that is purchased on the open market, the NSA purchases the other. Which means the total commercial market is smaller than we think. Which means if you are making business plans based on IDC numbers for the market size you are, well, wrong.

And this takes into account just the servers. Never mind networking etc..

Counter 1: But they don’t have to buy them all at once

That’s strictly true, but misses several key insights.

One Google/FB/et al are increasing their capacity very quickly. And there are other online services that store data and have collaboration (Box, DropBox etc). The number of services and amount of data is increasing not shrinking over time.

To keep up they have to buy as much total capacity as everyone is creating. And since everyone is buying, they are buying as well.

The other thing this doesn’t account for is that Google and FB are replacing older servers as they age out and die. And this probably happens at 4-5 year time scale.

And finally my 1,000,000 was based on data that is 1-3 years out of date, the numbers are probably bigger.

Counter 2: CPU’s aren’t getting faster.

Server performance and capacity is increasing. Although CPU’s haven’t gotten faster, the number of cores has increased. Which means that the NSA has to buy enough capacity to match the utilization levels of Google/FB etc. Given that Google and FB and others go to great lengths to improve utilization, this suggests that server counts are representative of the NSA capacity needs.

Counter 3: It’s not that many servers

This is a reasonable argument. this data suggests that 1 million servers is really only 1/350 of the total servers sold globally.

People

If you consider the sheer intellectual horsepower at Google then you start to scratch your head about where are the people who built this thing?

Seriously.

Because the NSA, thanks to Federal law can’t hire outside of the US.

So maybe the NSA can offer green cards and citizenship super-fast … But then who is doing the hiring? It’s not happening on college campuses and the best and the brightest are not going to IBM and Raytheon.

So where is the army of technically savvy people being hired from?

Managing

Having managed many servers at Zynga, managing infrastructure at a fraction of this scale is not easy.

Having being part of the team that built, what was at one point, the world’s large private cloud, I know that the software that is available to manage the infrastructure simply does not exist.

To make it, even remotely, tractable you need a lot of sophisticated software just to bring machine’s up and to take machines down.

This kind of software is not simple. And it requires big brains to assemble. And it would have to be built from scratch as nothing on the market exists that can replicate it.

Ingest

What nobody talks about is the complexity of managing the ingest of data. Let’s assume you’ve solved the infrastructure, the hiring and purchasing now we’re talking about magical software that is able to handle data coming from a myriad of different companies.

Each company has it’s own evolving set of data.

So either you have to deal with the unstructured format (which explodes the computational cost) or you’ve got teams of people working together at companies whose job it is to pre-process the data before it leaves your site.

In short

this smells of a fabrication. to what end, i don’t know.

 

 

PRISM – where are the frigging servers?

Over the last two weeks the 1% and its wannabe cohorts has been obsessively worrying about government spying. The rest of the world has tried to keep their jobs and pay their bills.

What’s weird is that the same guys who think the black helicopter conspiracy theorists are “nuts” are finding just cause with those conspiracy theorists.

What astonished me in this whole discussion was the really basic question of where are all the NSA’s servers? Most reporters focused on the technological feasibility of such a system, I want to ask the mind-numbing question of where the hell does the data live? And where is the infrastructure that computes the data.

Since most of us are software guys, including yours truly, we never ask where are the physical systems that run our software. But in this case, I want to.

Let’s speculate that to collect the data in real-time and analyze it in real-time you need an infrastructure as big as the one you are monitoring. What I am saying is that if FB requires 1 cpu cycle and 1 byte to store data as it comes, the corresponding system that is monitoring the data must need no less than 1 cpu cycle and 1 byte of data to store the same data. And the assumption is probably too simple. In reality the monitoring system has to spend more CPU cycles to analyze the data than FB, and can store less data as data. But we’ll stick with that assumption.

The server infrastructure that the NSA builds is bigger than the joint infrastructure of FB, Yahoo and Google. In plain English, the most complex advanced technology companies on the planet have built something that compared to what the NSA has built is a toy.

Just to put some numbers on this, FB had about 180000 servers in 2012, Google was using about 900000 servers in 2011, and Yahoo according to this report had 100000 but that seems to only count a small piece of Yahoo’s business.

We’re talking about over 1 million servers here (assuming 2012 numbers with no growth). You don’t just have 1 million servers with their switches and racks and disk-drives just sitting around … This infrastructure would represent a huge portion of corporate america (just think of Cisco and Intel for the frigging processors). This kind of deployment would literally show up as a significant line item in their balance sheet.

Where the f*k do you put 1 million servers? That’s a f*k load of power and networking.

If the NSA really has this kind of infrastructure that is off the grid, the logistics of purchasing, shipping and secrecy astonish me far more than the relatively insanely difficult problem of spying on FB in such a way that their top engineers don’t notice.

The fact that no one knows about this much infrastructure should convince us that this is an absurd tale.

But then again we fought a world war and built a bomb and nobody knew about it…

So when someone tells you the government is full of incompetent morons, just tell them: Absolutely not, they put together the world’s largest computing infrastructure and it took a low-level systems analyst to spill the beans and none of the press asked: where the hell are the machines?

The Power Supply Issue with PC hardware

steveballmer

Steve Ballmer must hate his life. His company builds this software, they then hand it to these bozos at Lenovo, and all of a sudden shit happens.

Latest problem.

The Lenovo Ideapad y500 has a dual SLI configuration for its 3D hardware. The problem with an SLI configuration is that it consumes a lot of power. I mean a lot of power.

To actually get the graphics hardware to run you need 170W power-supply.

Which is fine.

If you don’t need the graphics card, then 90W power supply will do just fine. And that is great. Because a 90W power supply costs 25$ these days and you can have several in your house.

And so here’s where the shit hits the fan. Last night I used a 90W power supply because I never paid attention to the 170W power requirements of my graphics cards.

And I spent two hours trying to figure out why my laptop was suddenly dropping frames, etc.

It was the frigging power supply.

Now I ask, why oh why, could Lenovo’s hardware monitors not just tell me that the problem was the power-supply? A simple warning? A notification? Something?

But no. Nothing.

Maybe there is a BIOS option for that…

NUMA results from Google and SGI

Saw this on high-scalability. Google performed an  analysis of NUMA. In that analysis they discovered many of the same results we uncovered at SGI in the mid-90’s. And that is super cool. It’s super cool because it suggests we, at SGI, were on the right track when we worked on the problem .

At the core of the results is the result that NUMA is NUMA and not UMA. And that to get performance you need to understand the data layout, and that performance is dependent on the application data access patterns.

What I find really cool is this;

Based on our findings, NUMA-aware thread mapping is implemented and in the deployment process in our production WSCs. Considering both contention and NUMA may provide further performance benefit. However the optimal mapping is highly dependent on the applications and their co-runners. This indicates additional benefit for adaptive thread mapping at the cost of added implementation complexity

Back at SGI we spent a lot of time trying to figure out how to get NUMA scheduling to work, and how to spread threads around to get good performance based on application behavior. One of the key technologies we invented was dplace. dplace placed threads on CPUs based on some understanding of the topology of the machine and the way memory would be accessed.

So it’s nice to see someone else arrive at the same conclusion because it probably means we are both right …

 

 

Real Time Disruption

One of my favorite things to watch is how industries get disrupted.

The thing that’s amazing is that in spite of all the information we have both in research and in practice the same story plays out over and over and over gain.

There are three that I am paying very close attention to.

  1. The disruption of college level and high school level education by on-line courses
  2. The disruption of the combustion engine with the electric car.
  3. The disruption of the legal profession through online legal services that address most common legal issues

What’s interesting about 1 and 3 is that the disruption took place as natural responses to market opportunities. There isn’t a single force of nature causing the disruption.

What is extraordinary about 2, the electric car, is that this a case where a visionary leader is actually creating the disruption through sheer force of will.

One of the key misunderstandings of the disruptee and their defenders is the assumption that technological and supply chain obstacles are insurmountable.

For example, how do you power your car when you go across the country?

What the disruptee’s don’t realize is that as the demand for electrical supply stations increase, because the supply of electrical cars increase, the supply of electrical supply stations will increase.

This takes time. Except when a visionary leader decides to make things go faster…

Which is what Elon Musk is doing again…

http://www.scientificamerican.com/article.cfm?id=teslas-expanded-supercharger-networ-2013-05

 The stations are only on the East Coast and in California today, but CEO Elon Musk announced this week that Tesla will triple the size of the supercharger network in the next month, according to AllThingsD. The network will span most of the metro areas in the U.S. and Canada by the end of 2013–meaning it will be possible to take a long-distance road trip in a Tesla without worrying about running out of power. Musk has said in the past that the company plans to install over 100 Supercharger stations by 2015.

A better language for Lua?

Given that I work in the gaming industry, I am always fascinated with what people will do with Lua.

The Terra project struck me as an interesting investigation into building a better low-level counterpart to Lua that is not C.

The key claim to fame for Terra is that it is a dynamic language that has near native performance because it is less dynamic than Lua.

The idea behind Terra and Lua is to use Lua as a scripting language for rapid prototyping and Terra for optimized components without having to deal with the messiness of dropping into C. What makes this particular system intriguing is that the Terra functions are in the same lexical environment as the Lua functions which means they inter-operate seamlessly while having the Terra functions execute outside of the Lua VM… as per their abstract:

High-performance computing applications, such as auto-tuners and domain-specific languages, rely on generative programming techniques to achieve high performance and portability. However, these systems are often implemented in multiple disparate languages and perform code generation in a separate process from program execution, making certain optimizations difficult to engineer. We leverage a popular scripting language, Lua, to stage the execution of a novel low-level language, Terra. Users can implement optimizations in the high-level language, and use built-in constructs to generate and execute high-performance Terra code. To simplify meta-programming, Lua and Terra share the same lexical environment, but, to ensure performance, Terra code can execute independently of Lua’s runtime. We evaluate our design by reimplementing existing multi-language systems entirely in Terra. Our Terra-based auto-tuner for BLAS routines performs within 20% of ATLAS, and our DSL for stencil computations runs 2.3x faster than hand-written C.

I don’t have enough experience with Lua to offer any insight as to whether this is a good idea… but it bounced along the internet superhighway so I’ll try to take a look.

Reading the paper, I realized there is a lot more there in terms of the sophistication and science of the challenge of integrating these two different languages. Okay… I’ll have to read and noodle.

Are the 300k servers Microsoft promised game changing?

Recently, as part of the xbox live announcement, Microsoft announced a dramatic expansion in the amount of compute they intended to add to their infrastructure. The plan, as announced, was to grow the server count from 15k to 300k, a 20 fold increase.

This is an astonishing amount of new servers to add to any new service, especially if you are not expecting a huge growth in the number of users.

The marketoon hypothesis

One hypothesis is that some guy in marketing asked some woman in engineering how many servers could the data center hold, and the woman said it could hold 300k, and the bozo figured that would make an awesome press release.

If this is true, the groans in Microsoft Engineering would be vast and awesome…

They are trying to do something different. 

Another, more interesting hypothesis is that they are actually trying to do this:

Booty says cloud assets will be used on “latency-insensitive computation” within games. “There are some things in a video game world that don’t necessarily need to be updated every frame or don’t change that much in reaction to what’s going on,” said Booty. “One example of that might be lighting,” he continued. “Let’s say you’re looking at a forest scene and you need to calculate the light coming through the trees, or you’re going through a battlefield and have very dense volumetric fog that’s hugging the terrain. Those things often involve some complicated up-front calculations when you enter that world, but they don’t necessarily have to be updated every frame. Those are perfect candidates for the console to offload that to the cloud—the cloud can do the heavy lifting, because you’ve got the ability to throw multiple devices at the problem in the cloud.” This has implications for how games for the new platform are designed.

One of the limitations of systems like the xbox is that the upgrade cycle is  5-7 years. The problem with a 5-7 year upgrade cycle is the difficulty in delivering better and better experiences. The effort to extract even better performance requires more and more software tuning until the platform is unable to give any more.

The approach Microsoft is taking to shift some of the computational effort to the cloud and leverage the faster upgrade cycles they have control over to deliver a better experience to their users without forcing their users to buy more hardware.

Several startups, that failed, have demonstrated that it is possible to stream a AAA title to a device. So the idea of doing this is not implausible.

With this, theoretical, approach the folks at Microsoft are attempting to square the circle. They have a stable and rapidly decaying platform in people’s homes, but use the hardware in their data center to give increasingly better graphics through a vast amount of pre-computed data.

The problem, of course, is that in practice the amount of things you have to pre-compute and store is so vast given 3D immersive worlds to be almost impractical.  Well, perhaps except if you had 20x more servers per user…

In the 2D space, this was basically the solution Google adopted to Google Maps. Confronted with the problem of how do you dynamically render every tile on the client, they pre-rendered data on the server and then had the client stream the data.

This is going to be very interesting to see… Although my money is still on the marketoon hypothesis…

 

Revolution redeems Hackers

Programmers everywhere still squirm when they remember this scene from Hackers

hackers

It was Hollywood’s pathetic attempt to make programming look cool … And it was groan inducing scene.

Over the years my wife and I have taken a certain perverse pleasure in inspecting what code is used on-screen. And over time we have found that the code has gone from nonsense, to recognizable constructs, to valid but irrelevant programs.

The Holy Grail was software on-screen that was correct and relevant.

We have found our Grail. In Revolution, a TV show we like, a character approaches a biometric terminal. On the terminal is code.

revolution

 

And the code is correct and relevant code. It’s software using code from the open biometrics initiative … And just in case the apocalypse does happen you and need some code … here’s the github repo.

My only thought was in the apocalypse, we will have no shortage of C++ programmers  because C++, C++ is the correct answer.