A cluster system from Microsoft

By joe

June 9, 2006 - 11 minutes read - 2264 words

I had a conversation recently with two nice people from Microsoft about their (now released) WCC product. One of the people, Patrick wrote a comment (for some reason wordpress is editing it, so go to this URL: http://scalability.org/?p=59#comment-40 ) here that is worth looking at.

I have been skeptical of the WCC product in that I didn’t understand what Microsoft’s vision was for this (no guffaws here), and thought that I might be misinterpreting what I didn’t hear. That is speculating.

First, they have a good sound basis for understanding the market today. End users have voted with their pocket books towards lower cost solutions, and they have noted this. As I have pointed out many times over the last 2/3 of a decade, we are in the era of disposable computing, and the era of personal supercomputing. It is interesting to see Microsoft adopt the latter phrase for its own usage, but they are largely correct in doing so. Supercomputing is moving down market. The market for $1M heroic architectures has been and will continue to be on the decline. For the simple reason that you can buy the power more cost effectively elsewhere. COTS platforms provide economies of scale that the small run-rate supercomputers never could. They see also that the growth market for supercomputing is in the sub $250k region. Microsoft gets this, they understand this. This is good.

Second, they have a belief that the world of supercomputing is defined by the ISVs, and the ISVs are clamoring for windows. Don’t shoot the messenger here, let me explain this in terms of what happened in the unix days.

Way back when I used to talk to customers from within the SGI fold, we would talk about how wonderful IRIX was, how many titles had been ported to it, how few had been ported to Solaris (scientific computing, engineering, etc) and others, that end users should buy SGI machines and IRIX systems. This was a relatively weak argument by itself, but combined with others, we were able to make a case both to ISVs that their market on the platform would be large, and to customers that they would be able to get the applications. This effectively locked out other vendors from critical market segments, bread-and-butter for SGI, for a number of years.

What killed unix (not Linux), and for those who don’t realize it, rigor mortis had set in years ago, was that you could not write code to a single platform. You had to port code. Which meant the ISVs needed to pick and chose which platforms. Which meant, and this is the crux of the argument, they would pick the highest volume markets.

Quoting a joke I heard about 2 physicists in the woods seeing a bear, one says to the other “I hope you can run fast.” The other says, “all I have to do is run faster than you.”.

Something like that is at work here.

Unix died as it required ISVs to port to many different flavors of Unix to get their product to market. It required them to support many different flavors. It required them to purchase machines with many different flavors. This massively increased their development costs and time. Add in multiple incompatible MPIs and other libraries, and delivering software became quite expensive.

Hopefully I didn’t lose you with this. I am going to come back to this later on in my discussion of this.

What Microsoft is saying is that this is the current state of affairs in Linux. Again, don’t shoot the messenger. They claim that you have to worry about supporting multiple distributions, multiple MPI layers, etc. They claim that this is expensive (multiple MPI is expensive but this is true on every platform including windows), and they claim they have an alternative. What they will provide are the APIs, a job scheduler with API (Microsoft API, it appears they will not conform to DRMAA … but will leverage some GGF stuff for grid). They will provide the programming environment, the comfortable user interface, the ability to develop applications on the desktop and move them to the cluster. And do so seamlessly.

ok… got it ? Lots to digest.

Lets go through this, and no we are not bashing Microsoft, simply being critical and skeptical.

First, are they right that unix is dead ? Yes. I know there are some strong solaris supporters out there. I have been asked a few times by Sun folks when Solaris will take over for Linux. The answer is never. I am sorry about this, and Sun is a partner of ours, but we simply do not see a future in volume systems for this OS. The vast majority of ISVs, developers, and end users we have spoken to have told us that there are precisely two platforms going forward. Windows and Linux. Keeps their costs down. Adding Solaris increases their costs. Helps no one but Sun. What about AIX, HP/UX, etc ? I argue that they are non-volume platforms.

Second, what about their contention that Linux is fractured and you have too many distros, with too many APIs? I didn’t bring this up in the conversation with them, but this was fundamentally why the ISVs begain porting to linux in the first place. They saw the linux cluster platforms on the rise, and they saw a single API to write to. Porting there was relatively simple. Since then we have seen many ISVs make Linux the primary target platform for the computational side, and relegate other unix to teirs. This is all about volume. Linux drives much higher volumes for them, which increases their revenues. Moreover, as the purchase price of Linux and linux-based hardware is low, the customers overall purchase price is significantly less, and therefore the customer is happy. Customers are never happy when you increase their price. More on this in a minute when I talk about business models.

Moreover, Linux lets you develop on your laptop/desktop and run on the super cluster. We have been doing this for years, as have many others. We run Linux on most of our desktops and laptops. Works great, and the development environment is quite nice. Rather low cost as well for the development tools.

So are they correct in implying that only they offer that capability ? No. Has existed for years.

What about the idea that multiple distros are hard to support ? We develop our applications on SuSE 9.x/10.x, copy them to Redhat and run them. They also work seamlessly on Ubuntu, cAos, Centos, … That is, this argument is IMO pure FUD. But then again, they are looking to find points of differentiation, and their claim is that windows is the same everywhere.

What about their point that MPI isn’t the same everywhere. This is mostly FUD, but there are elements of truth there in that you cannot today take an object file built with LAM and run it with an MPICH loader. THat is it is source code compatible, not object code. But, and this is a very important point, this is completely platform independent as a problem. This is an MPI specification issue, and implementation issue, **NOT **a platform issue.

What about their scheduler based upon their own API, using GGF for grids ? Again, the market is awash in high quality schedulers, this is not a significant differentiator.

The potential danger to cluster users running linux arises from a competitor with nearly infinite resources to push their vision. Their vision describes where we have been and that is accurate. Where we are going and the problems being faced today are something I disagree with them on some specific points about, and you can see where the FUD is pretty easily.

The potential benefits to cluster users are if Microsoft wants to play with rather than overtake the linux cluster market, there are lots of opportunities for them to do so.

That said, lets get to some of the points I wanted addressed which weren’t to my satisfaction. The CTC runs one of the few and largest windows clusters. They have to run antivirus and firewalls on every node. As anyone with a laptop knows, this is a performance killer. More importantly, corporations demand that every windows machine run antivirus/firewalls, at least all of our customers do (fortune 500s, small/large/…), this is regardless of whether or not they are desktops/servers/whatever. This may have been lost on the Microsoft team. I am not sure they realize how painful this situation is.

ANtivirus means that you are scanning everything. All IO activity goes through a scanner. This reduces IO throughput. This requires that each node with a disk also have a copy of Norton/etc. More on that in a moment.

Firewall means that stateful packet inspection is in force, as you need to prevent access to a relatively weak security modeled system. You need to tie it down hard. This means you are not running CIFS. The number of attack vectors on this is huge. I wish I had preserved old logs where I used to watch probes on those ports to send to these folks. How then are you going to handle home directories and shared file systems?

These were not answered to my satisfaction. Of course you can scan on the front end. Doesn’t mean it will capture all payloads. We know that all too well.

The competitor they are going after, Linux, does not have these problems. You don’t need to scan for viruses. You are running in least priv mode. You limit open ports to 22 (ssh) and an NFS or similar file system. You can even do an ssh file system if you use FUSE. You have many options, and none of them require purchasing a copy of an antivirus per node. Your development tools are free, your cost per node of each copy of Linux and all tools is 0$.

With that, lets talk about the financial models.

Best case cost scenario for windows: N copies of Windows + N copies of application + N copies of Norton.

Best case cost scenario for Linux: N copies of application

The point about this is that from an acquisition cost standpoint, Linux is going to always be lower, unless Microsoft is giving away free WCC. I don’t think this is going to happen.

Second, what about support costs. Here Microsoft has indicated that they can run/manage things easily from a central point. Excellent. This is exactly what you want, and it follows the Linux model. Make management easy.

Third, incremental costs. Each additional node will bring hardware, software, support costs. Assuming no changes to network infrastructure, these costs will be larger for the windows platform than for the Linux platform.

Unfair competitive advantage. What is the elevator pitch which describes why WCC will be better than the competitors ? I wasn’t presented with it, I would like to hear it.

I can make a good one as to why Linux will be better for various users. Goes something like this: Building a linux cluster allows you to run your common applications at a low cost per application without paying additional money for necessary antivirus software, windows licenses, and MCSEs. It allows you to get off the patch tuesday rollercoaster, and focus upon getting your results out in less time at a lower cost. Without antivirus software your machines process data faster. With linux you can contain your purchase, incremental, and support costs. With linux, you can get systems which are stable in time periods measured in years, which means that much less administrative time and cost, and much less downtime.

For ISVs the pitch is even simpler. Linux clusters are powering the 60+% CAGR growth rate seen in the cluster market, growing the current HPC market past 9.2B$ this year, and growing at 20% per year. Linux clusters are the fastest growing and largest segment of the HPC market, and will continue to be so for the foreseeable future. With a linux cluster, your customers dollars/euros/yuen go further, as few of them need to be spent on needless antivirus and OS licenses, not to mention support contracts.

And that is the point. Linux has as far as I can see a number of unfair competitive advantages over WCC, and my concerns were not addressed in my very short discussion with the good team over at Microsoft. This doesn’t mean they never will be, nor will I never change my mind. Microsoft has as indicated, near infinite resources to play in markets.

What I took away from the conversation is that Microsoft has a good grasp of where things where, and where they want to go. What I think they have incorrect is an understanding of critical aspects of the market, and they have put a decidedly Microsoft spin on the market.

ISVs want to contain costs. So do customers. Switching to a more costly platform without a critical functionality not available elsewhere, increasing virus/firewall issues is not necessarily what I would consider a strong go-to-market pitch. That said, if our customers demand it, we will supply it. I haven’t seen this though, and few are raising the concept.

These are my thoughts, and they will evolve. I believe that Microsofts thoughts and model will also evolve.

Update: HPCWire is reporting a volume price of $469/compute node. This means a small 32 node cluster will be $15,008 USD more expensive based simply upon OS alone. Now add in volume cost for Norton, about $1600/year per cluster.