Saw this on high-scalability. Google performed an analysis of NUMA. In that analysis they discovered many of the same results we uncovered at SGI in the mid-90’s. And that is super cool. It’s super cool because it suggests we, at SGI, were on the right track when we worked on the problem .
At the core of the results is the result that NUMA is NUMA and not UMA. And that to get performance you need to understand the data layout, and that performance is dependent on the application data access patterns.
What I find really cool is this;
Based on our findings, NUMA-aware thread mapping is implemented and in the deployment process in our production WSCs. Considering both contention and NUMA may provide further performance benefit. However the optimal mapping is highly dependent on the applications and their co-runners. This indicates additional benefit for adaptive thread mapping at the cost of added implementation complexity
Back at SGI we spent a lot of time trying to figure out how to get NUMA scheduling to work, and how to spread threads around to get good performance based on application behavior. One of the key technologies we invented was dplace. dplace placed threads on CPUs based on some understanding of the topology of the machine and the way memory would be accessed.
So it’s nice to see someone else arrive at the same conclusion because it probably means we are both right …