Accelerated computing appliances

By joe

May 1, 2007 - 5 minutes read - 1026 words

Having spoken with quite a number of potential/current customers on this topic, we had been assured that end users were largely disinterested in expensive single point-function computing devices. They wanted inexpensive, and fast, and reusable for other things. 10x current platform performance for well under 10k$US was what really stuck out in our responses to inquiries.

Of course, this was over the last several years, and you get 10x from Moore’s law advances every 5-6 years, so you can always just wait. Sort of, not quite, there are other considerations that such a vastly oversimplified and naive economic model of purchase do not take into account. We have been showing Progeniq appliances for a few months which provide about 10x performance +/- a little over current AMD Opteron hardware for informatics, specifically HMMer and ClustalW. Smith-Waterman is claimed to be much faster, though our measurement of the performance as compared to the SSE2 enabled SW from the FASTA package indicates parity between the accelerated hardware and software. The Progeniq package is under $5kUS. Our own Scalable HMMer gets you 2x better performance, and is free (as in Open Source). The mpi-HMMer project gets you to about 50x and is free (as in Open Source). Well, the good folks at Mitrionics have worked with the folks at SGI to create a bioinformatics appliance that sells for about $30k, which can run blastn 4-10x faster than current platforms (more in a moment), and currently, only runs blastn. Blastp is promised, soon. Now compare that with mpiblast which can get you 10-100x, again, for free (as in Open Source). Our forthcoming xlstrblast will drive mpiblast as well as ncbi blast for clusters, giving you one interface to run both, seamlessly. Won’t cost $30k (but you would be welcome to pay us that if you felt you needed to). I get the sense that some people don’t entirely grasp accelerated computing from a market perspective, from a use perspective, from a need perspective. Re-inventing wheels doesn’t make sense unless you are going to come out much faster, better, cheaper, and will be able to capture more market share. Now on to their benchmarks. They stated in the HPCwire article that

This is interesting, in that this might be one of their first public admissions that Itanium2 is not the right direction. However, I question a few things. I am willing to bet (knowing SGI) that they built the blast software with the intel compiler (why not), using options which were appropriate for the xeon and itanium2. It is fairly well known that the intel compiler tests a string in a processor up-front to select code paths … that is, if the string doesn’t match “genuine intel xeon …” something or other, the slowest code path is taken. That is, rather than test for specific functionality (e.g. SSE2/3/4), they test for the presence of a string. Note: the last time I looked in a binary generated by an Intel compiler v9.1.xx (literally last month) this is still the case. No other processor vendor can put that string into their CPU, or they will violate copyrights. Hence, only intel chips will be given faster code paths by intel compilers. Which means if you build your code with Intel compilers and then run it on AMD chips, you might be losing lots of performance on AMD. Sad, but this is how marketing at Intel appears to have negatively impacted their otherwise excellent engineering. It also means that apart from exceptional cases, we wave people off the Intel compilers. For a vendor wanting to make a dig at a competitor, using the slower code paths allows them at least a glancing blow rather than a direct full on-frontal assault. Our measurements show approximate parity (within a few percent either way) for blastn run on Woodcrest 2.66 GHz units and AMD 275 (2.2 GHz) units with similar memory/IO configs. This is not using Intel compilers. When using the Intel compilers, performance actually drops relative to our baseline tests on the Woodcrest. It drops significantly using the Intel compilers on the AMD. Of course, one wouldn’t expect SGI, an Intel shop, to tell you this. So, please take their benchmarking numbers with a kilogram or two of salt. We already know Itanium2 is not appropriate for these tests … we have an Itanium2 in house we work with, and while it is quite nice for single threaded, cache bound FP code, pretty much anything else gives it conniptions. But back to the appliance. We have been telling people for a while that you can expect about 10x, possibly up to 100x if you work hard, per single accelerator device. We have also been speaking to customers whom have told us that 5x performance is sort of the baseline. If you cannot achieve this, then it is generally not worth the money. Similarly, they have told us what they want to spend, maximum, on it. SGI comes in 4x faster than their “fastest” platform (never mind that this is suspect), and at about 3x more than what our customers tell us they would be willing to spend. Let me ask this. If you simply go out and buy 8 more nodes for that $30k, and get 16-32x performance using mpiBLAST, precisely, what is the value of their hardware bioinformatics accelerator? Today, we can get 10x from a sub $5k accelerator for HMMer, ClustalW, and others. With our software we can get 2x from nodes. With our software layered atop the accelerator, we can get 20+x out of 2 nodes + 2 accelerators. With our software with a good MPI stack, we can get 50+x out of 64 nodes. Adding in the accelerators, we should be able to hit 100’s of X faster on large enough problems. 4x won’t cut it. 10x wouldn’t cut it at $30k. Having done my time at SGI in the past, I can tell you that there are some very bright people working on the technology, people I was honored to have worked with. I had the sense that SGI marketing just didn’t get it. Seeing this announcement confirms that little has changed.