Deepak over at the always enjoyable mndoci.com blog asked a great question. Its the question about the utility of any technology, and specifically he asked whether or not APUs and accelerators in general would be useful.
Around the same time, a BAA was released by the US Government requesting people start giving accelerated computing a serious look for their applications, though they didn’t indicate the applications explicitly, one could guess about this.
And recently, a good friend came over, from one of the three letter acronym computer companies, and we talked for a bit. We got to talking about accelerators, and he realized how a competitor had beat him at a large HPC bid, how they were able to demonstrate far higher benchmarks with far fewer resources.
Meanwhile, Peakstream and Rapidmind have released tools, and there is a paper on Sequoia, which I presume is related to the Peakstream environment.
Today our partner Terra Soft announced a Cell-based supercomputer.
This represents a rapid confluence of many different streams. Deepak’s question on utility is quite apropos. Power without an ability to use it would provide the emporer with new clothes, but do little else.
Utility is a function of the success at bringing value to the application. That is, if your application takes a very long time to run, and ifyou can easily make it faster by using an accelerator, **and **speed is important to the function of the application and the use of its results, then it would have high utility. If speed of solution is not an issue for you, and neither is throughput, then acceleration technology will have marginal or negative utility value to you.
I spoke to another friend last week about accelerator technology. He is in a large manufacturing organization. We described what it is we want to do, and have demonstrated. In a flash, he was quite excited by what we were talking about. His usage model is constrained by cycles, and his employer purchases cluster nodes by the thousands. He even made the leap to our future plans as a natural extension (and a huge market expansion) without our prodding.
I am of the opinion that in certain market segments, that accelerator technology, APUs are of very high utility. They will not help you run excel faster. They will help you run the jobs whose results you put into excel, faster.
In other news, apart from a few disbelievers, the world has turned fast and hard to multi-core technologies. How to program them is not really a hard question, but the impact upon shared and scarce resources is.
NUMA is a way to describe physically disjoint but logically cohesive/contiguous memory. In NUMA machines, you have a hierarchy of memory systems, and a random memory address may not be equivalent in terms of access latency/bandwidth to another. This usually means you have to consider memory access patterns in optimization and tuning efforts. If your thread hops between multiple processors attached to different memories, it is well known that your cache will be less effective over time. Add to this differing memory latencies and bandwidths, and benchmarks such as stream may show interesting results. These results are reflected in applications which heavily use the memory systems. We have seen about a 30% variability when NUMA was not taken into account with STREAMS and benchmarks with a similar access pattern on early Opterons. Processor affinity helps, as does intelligence in the scheduler to co-locate memory and processes on the same socket.
Now add multi-cores. You have to worry about pinning the memory to the socket, but you also need to pin the process to a particular core. You can do that naively by default using processor affinity, but in cases where you have a large shared cache, process affinity can be a little looser.
But the real issue is that in scheduling, you are now going to need to balance resource usage across sockets, so that heavily used sockets, such as lots of IO or lots of memory, or lots of interrupt handling, don’t get saddled with another burdensome process.
That is you have hierarchical resource scheduling as well as process scheduling, and affinity to worry about. Resource contention, and managing this, will be important in this world. So you can now worry about making the pipes out of the processors wider and adding more resources on the chip, and getting data in and out of the chip.
Which also means that you might be able to start specializing these chips/cores. Just like the Cell, with one processor core, and 8 highly specialized cores. Anyone want to argue the Cell is just a powerpc with some interesting bits attached? This is an accelerator. An APU. Its utility for games is obvious. Its utility for scientific computing is just now being explored, and initial results are tantilizing.
This doesn’t mean FPGA’s are not, or Clearspeed are not interesting. They all are.
The question at the end of the day is what value you can use them to bring, and how hard you have to work to bring that value, and how hard you have to work to bring additional value. Cell has not yet demonstrated that it will bring value, though many, including myself, believe it will. FPGA’s have demonstrated value, though in the past they have exacted a high price to realize this value. They preceded Cell in supercomputing by at least a decade, and have numerous successes to point to. Unfortunately for developers, you need to recode in VHDL if you want performance. You can use the C->VHDL compilers, but you will lose lots of performance by doing so. The people doing good things with FPGAs will be the ones who develop the better tool sets atop them.
Utility in large part is the value of what you can do with it. Accelerator technology promises more computing per unit time. This does not necessarily equate to being more productive. However, using our history as a guide, the more power we provide modelers, the more power we give to the people who simulate and analyze, the more results we do get per unit time. The Council on Competitiveness measured this in various surveys. This suggests that HPC is essential. If this is the case, and we can significantly increase the repeat rate for results generation with accelerator technology, as has been demonstrated in some cases, then there is a strong case for a high utility value.