HPC in the first decade of a new millenium: a perspective, part 4
By joe
- 16 minutes read - 3287 wordsThe impact of markets, and government upon HPC … While the charts from top500.org are nice, they don’t tell everything that happened in this interval. We had 3 recessions, 2 major (2001 and the “Great Recession”) and 1 minor one. We had significant changes in research funding from the US federal government … a refocusing of DARPA on things less HPC specific. These elements all contributed to the trajectory within the decade. Several small players turned into larger players, then flamed out. Linux Networx, arguably one of the better contenders, couldn’t avoid bad business conditions imposed upon it. Some have commented at insidehpc (can’t find the reference to this now) that the HPCC programs are killing off the vendors. There may be some truth to that.
RLX was killed off by other vendors seeing value in its lower power consumption and centralized management message. RLX was early with Infiniband. They had many good ideas. Their execution wasn’t so good. HP acquired their assets. This said, blades are rapidly displacing more traditional pizzabox clusters nodes. The rationale works out in terms of management and power. It doesn’t work out in terms of vendor lock-in. Remember that George Santayana quote? I suspect it will play here again. Orion Multisystems had its 15 minutes of fame, and it flamed out due to bad choice of investors. Its technology was an underpowered cluster in a box. What Orion did was to confirm that there is a strong desire for increasing the number of cycles per unit time. What Orion didn’t get, and what RLX before them didn’t get, was that people didn’t want their programs to slow down for the privilege of running lower power. They wanted the same or better speed. Which for RLX and Orion wasn’t possible. The pace of flameouts did not slacken. Good companies went bang. Bad companies went bang. Merger and Acquisition (M&A;) was a little slower during this interval. There was a consolidation, but it usually was in terms of picking over corpses, asset sales, and alike. Compaq had bought DEC, which many argued was dying. HP bought Compaq, and again, people didn’t quite understand that. HP also snagged Convex. SGI bought Cray. That last purchase, coupled with hubris of SGI, and a massive dig-in-your-heels view of Cray doomed that acquisition. SGI was (past tense) “The Gee-Whiz Company”. The company that bears the name SGI today has very little to do with that company from 16 years ago. In 1999, SGI/Cray was suffering badly. The brain-drain followed years of ineffective leadership … the tail end of the McCracken era, the Belluzzo era, and now the Bishop era. Its much vaunted purchase of Cray had the net effect of focusing company execs internally, when they needed to focus externally. The company had lost its way. It wasn’t ready for clusters, but it had to do something. In early 2000 it got into clusters. SGI had a lower end server business started up by a few folks. Joseph Wei, one of my LinkedIn connections, managed a chunk of this. We tried to define the cluster business internally, helping to get an application focus in order to differentiate us then from things HP, IBM, and others could do with cheap boxen. And it worked. I created SGI GenomeCluster (the software portion called CT Blast), as I had realized that clusters were just expensive door stops without software, and that was a critical differentiator we could create. The team created a nice offering around it, we marketed it, we sold a few, we demoed it at SC00 and LinuxWorld in NYC. And then in late January/early February 2001, Warren Pratt from SGI, then VP engineering, decided that we weren’t going to do “Pee Cees” any more. Which included clusters. Yes, in early 2001, SGI got out of clusters for the first time. Right before they were starting to take off. I left the company in March of that year. I had had enough of bad decisions by poor leadership. That was the straw that broke this particular camels back. The rest of SGI is, as they say, history. It followed an augured in trajectory, not understanding that it made a terrible mistake in 2001. It ceded market share and mind share to rivals. It continued to push expensive cycles in a market that demanded lower cost cycles. It pushed Itanium. Cray was spun out to Tera. Which changed its name to Cray. The name has marketing cache'. Use it. And they did so, quite well. When it finally started getting onto the x86-64 bandwagon, it was late to the party. Dell, HP, IBM had all beaten it there, and were ramping sales up on clusters, hard and fast. The smaller fry, Linux Networx, and others, we ramping sales hard and fast. How was SGI to compete with this? During my time there, I remember sales/technical/marketing meetings where the point was raised that you can’t paint a box purple and charge 3x the price for it. SGI had been trying to do that. SGI finally declared bankruptcy … for the first time … in 2006. They re-organized, but had they same problems when they emerged. These were hard problems to overcome. How to differentiate yourself in a commodity market. In 2008, they picked over the bones of Linux Networx (LNXI), grabbing IP, staff, customer lists, and their CEO. Bo Ewald. The man responsible for selling Sun its E10k box for $100M right after the Cray acquisition by SGI, which made Sun something like $1B that first year, after claiming that it wouldn’t be right for SGI to swap out our boxen instead to customers. LNXI died for a number of reasons, none of them technical. Most relating to not being able to say no to onerous terms and conditions. And then being forced to accept equally onerous terms and conditions on capital loans. All you need was one hiccup in a government procurement, one delay in payment, and whammo. Whammo happened late 2007 early 2008. LNXI had to pledge their assets and IP to cover loans that they needed to operate while waiting for the government to pay them. And the government delayed payment. Game over. Thanks for playing. The CEO of LNXI at the time of this? Bo Ewald. This is the CEO that SGI got. And what did he deliver? In 2009, SGI hit chapter 11 for a second time, and sold itself for $25M or so. Rackable changed its name to SGI after buying the assets. There is no guarantee that the entity currently named SGI will survive and thrive in this market. They are up against huge entrenched folks like Dell and HP, IBM, and others. The cluster market has seen its margins drop precipitously, so that only a few large players with good economies of scale can survive here. Many other cluster vendors (other than LNXI) went away during these times. These folks had little differentiation, they were mostly white-box stackers. There was little design effort on their part. One can argue that IBM, Dell, HP etc are also white box stackers, their parts are all made by the same companies, and their products are all manufactured on the same/similar lines. The 3 large manufacturing houses provide ODM/OEM capability, for minor tweaks to more or less standard packages. That is, there was a strong drive for consolidation during this time. Smaller value providers either found a larger friend to join up with, tried to go it alone, or failed. I will argue that the same happened with the big vendors. It is hard to classify Sun as a success story. It would take a redefinition of the word “success” to include … well … failure. Sun, once a darling of Wall Street never quite recovered from the bubble bursting in 2001. Their failures, while not of the same magnitude as SGI, were epic in their own right. More of a death by thousand paper cuts. A failure to adapt. Pick any cliche' that fits, that may be apropos. SUNW, later JAVA did not fare well during the decade. Sun could not figure out if it was a hardware company, a software company, competing with Microsoft, a friend of Microsoft, competing with open source, a friend of open source … Part of this confusion was the result of one of the founders never being able to let go of control of their baby. They had a vision, and an ego to match. The market was changing on them, rapidly. Just as it did with SGI. No longer were the expensive RISC machines needed in HPC, and the Sun machines weren’t particularly known for speed anyway. I do remember comparing a shiny new Sun workstation with some SPARC processor against a shiny new IBM clone machine running OS2 in 1991 or so, running a fortran code (my MD code at the time). What blew me away was the roughly $20k Sun box was about the same speed as the new 486 based machine I was working on. Later on, I remember comparing the shiny new SGI 4 processor superworkstation against a new version of the Sun box. Around 1993 or so. I’d never seen anything so fast. We moved all of our runs onto it. Basically, Sun from my personal experience before SGI, and from users experiences during and since then, never really had a serious HPC chip they could point to. So they didn’t factor significantly in this market using SPARC/RISC. But they did get a clue somewhat later in the game using Opteron. They realized that they needed to be in this market. Which likely caused the same sort of internal squabbling that caused SGI to take it eyes off the prize in 1995-1996. Way back then, I remember a sales meeting we went to with T.J. said something to the effect of “wow, those little Challenge S boxes make great web servers.” And then we effectively ignored that market. Ok, we didn’t completely ignore it, we had a web division (and Cosmo Create was IMO the only web creation tool that didn’t completely suck at the time), but in 1995 time frame, having the hot box for web serving, and not figuring out how to capitalize on it, shows the profound and deep myopia that seems to permeate the cult-of-personalities companies like SGI … and Sun. Yeah, hindsight is 20-20. Certainly. And having a market and outward focused view of what customers want, how the market is changing, how your product offerings need to change to help you adapt and grow … yeah … thats important too. SGI didn’t have this. I’ll argue that Sun may have fallen into some serendipity with respect to some of these things, but they largely had the same issues. Now, today, Sun is waiting on EU permission to sell itself for a tiny fraction of its market cap some years ago. It has lost mindshare and marketshare to Dell and HP, among others. It hasn’t really ever resolved the Sparc vs x86 conflict internally, there are still many ideologues who cannot accept that Sun makes most of its money from the x86 hardware it sells on the server side. Once the purchase is approved (or denied), it is hard to see most of Sun’s hardware continuing on. Yeah, there may be database appliances. There may be backup systems. But HPC stuff? Infinitesimal market to Oracle. You should expect them to basically say “its over in these markets” and then list them … I expect HPC to be one of them. Which leaves in doubt some of Sun’s acquisitions. To wit, they acquired the assets of the Cluster Filesystem company (Lustre developers). This is an HPC file system. A notoriously difficult to install/tune/run file system. It is open source, and DDN may continue it, but I don’t expect much post-sale to continue at Sun. They also acquired VirtualBox (Innotek), SGE (fka Codiene), MySQL, and others. Oracle was after Java ownership though. SGE is widely used in HPC, and MySQL is often used in conjunction with HPC database attachments (though I personally prefer PostgreSQL). Oracle is also famous for buying companies, and then changing its mind about offering on-going development. It did this with a Xen-based VM provider (VirtualIron) recently. Leaves you up a virtual tree if you have a business dependency upon them. Which is also very much on the minds of those folks looking to buy gear. When they look at Sun, they need to be at least aware of the risks. Unfortunately, with all the delays and competition from HP, IBM, and Dell, there is no way this acquisition will end any other way than badly for HPC at Sun. There is no upside … just a limitation of how bad the downside will be. Expect lots of good Sun people to be kicked to the curb within a short interval of the close. SiCortex failed. Well, ok, they failed to raise more capital. The company/business was going strong. Just not generating enough to sustain themselves. They were a victim of skittish VCs. Which leads us to Landman’s first rule of business “If you have to rely upon VC money to sustain your business, you are doomed … DOOMED I say!!!”. Humor aside, the serious undercurrent is that taking VC money represents an existential risk to an entrepreneur’s effort. Sounds strange that way, VC’s often like to insist they are taking the risk. Which they then proceed to magnify greatly by forcing founders out and their own CEO and management team in. Unfortunately, this is a game you have to play if you need capital to jumpstart efforts. The most important aspect of this is to get to break even as quickly as (in)humanly possible. Do not rely upon VCs. Ever. In fact, use as little as possible of VC money. SiCortex didn’t get to break even, though they were nearing it. Orion wasn’t anywhere near break even. RLX may have skitted around at break even for a while. You have to be past break even, and cash flow positive in HPC. Cray is there. They have been buying back their debt, at a discount. They are a well run, well managed company (I have nothing but respect and admiration for Peter Ungaro). IBM is there, but IBM is a huge company with an uncanny ability to adapt to changing markets. HP is there, same points, and they are winning cluster business. Dell is there, they have a good HPC team (Jeff L, Glen O, Tim, and the team), and are winning cluster business hand over fist. But even for the big companies, HPC represents a small fraction of their total revenue. For it to remain an important part of the offering, it has to be offered at a profit. And this gets to another observation for the decade. Many deals we saw won were bought. I don’t mean this in an illegal sense of bribery or kickbacks. I mean this from the sense of going negative on margins. Which leads us to Landman’s second rule of business " If you have to pay a customer to take your product, you won’t be around long enough to enjoy your ‘wins’." Basically, a sales rep is often trained to win at all costs, and sales management is trained to beg for margin exceptions at the first sign of serious competition. This is in part because many people have been trained by these actions to equate value with the inverse of price. The higher the value, the lower the price is what is suggested. Value has many components, price being but one. I remember going to a university bid open, where the RFP as presented specified a minimum acceptable configuration. Which, was promptly thrown out the door when one vendor (who shall remain nameless) submitted a much lower spec and lower priced system. I remember the smug look from the purchasing agent. She didn’t care that the system did not meet specs. She only cared about the lowest price. I do know that this particular system was rebid later on, as the researchers realized that the awarded system would not meet their needs. Could have saved everyone time/effort/money by doing the right thing … but this rarely is done. The race to the bottom price (not the highest value) has some consequences. Notice the dearth of cluster vendors out there. No, really, there are far fewer. If you can’t make money in this, you are going to do something else, or switch your focus. Ok, a smart company will do that. A dumb company won’t and will go under. This reduction in options is good … how? Ask this of your favorite purchasing rep next time they try to force the low cost option on you. We have seen companies offer clusters at below the COGS for the parts, even bought in high volumes, just to get the ‘business’. Deals like this are not worth it. I alluded to bad terms and conditions in the past. We have seen some doozies. Free perpetual 24x7 support. Must take back anything we don’t like, and pay us to find a different solution. Net-75 terms, from auto companies. We keep things simple. We use and price against our T&C;, which is designed to be reasonable to all, not grossly or onerously one sided. Other folks happily accept any silly T&C; thrown at them. These are the same folks who will willingly give away the store to make a sale. What I found, even with a sane T&C; that a customer accepts, when that customer is a government, they can, and will, construe it to mean what ever they wish. We had this dealing with a customer last year. I am sure this was the major cause of LNXI headache. We are adapting our T&C; to handle this going forward to handle these issues. Which leads us to Landman’s third rule of business “If you accept any old T&C; thrown your way, don’t ever expect to get paid on a reasonable time scale.” You may wind up funding your customers using your gear, interest free for them, for years …. as LNXI apparently did. Which leads us to Landman’s fourth rule of business “If it is a bad deal, walk away.” We have done this more than once, and will do it again. We try to work with folks to make sure the rules are reasonable and everyone is ok with them. If they aren’t, we aren’t going to risk the company on a bad deal. It is better to walk away from a bad deal than take a bad deal. Companies came and went in the HPC space during the 2000-2009 decade. Some died of their own failings, some died due to market failings. Some were acquired. Some had their assets acquired, and a few employees strung along to the new entity. Cilk, RapidMind, and PeakStream were acquired. In the two former cases, their products would be integrated into the purchasers offerings, in the latter case, the team was the valuable asset. Topspin, and other Infiniband companies were acquired. In the case of Topspin, they couldn’t prevent Cisco from deciding to leave IB in favor of 10GbE. MySQL was acquired by Sun, for a staggering sum. Cluster FileSystems, Interactive Supercomputing, … and many others were acquired for assets. In this time, with the Great Recession still a strong recent event, not fading into memory, there is an impression on the part of acquirers that the target must be in trouble, so they can offer next to nothing to give them an exit. Get the IP and people and brand for effectively free.