What high performance storage isn't ...

By joe

May 6, 2012 - 7 minutes read - 1348 words

This happens often. We get a call from a user whose seen my postings in the Gluster or other lists. They’ve set up a storage system, and the performance is terrible. Is there anything that can be done about this? We dig into this, and find out that the people bought hardware, usually fairly low end/cheap brand name (e.g. tier 1) nodes, with limited disk options, and are running 1 disk for OS, and have another single larger SATA or SAS disk for storage. Take say 8 of these nodes. Put Gluster on them. Make a replicated distributed system. Should be fast, right? No. Not even close. Its going to be (very) slow. And, curiously, this is exactly what people rapidly discover. But then they start ascribing the wrong reasons to why its slow. And this is where they start calling us, asking if we can help them tune it.

One of the hard parts about being a techie is admitting when you’ve made a mistake. Like designed or implemented something poorly … not out of malice, but out of a lack of a thorough understanding of what the issues are, and how your needs mesh with the solution you are designing. I’ve seen this pattern repeated my entire professional life, first with computing systems, now with storage. You can usually tell when people don’t understand aggregate processing power of a system, when they begin to sum clock speeds of the processors within this. I touched on this in a recent previous post. This is a core and fundamental fallacy in understanding. If a vendor is quoting you performance in this regard, go find another vendor, preferably one with a clue. If they talk about “N cores running in aggregate at clock speed X” then you can be pretty sure they understand at least some of the nuances of parallel and high performance computing on complex architectures. If they start talking about instruction graduation rate comparisons, various SSE/AVX artifacts, etc. you should buy em a beer, and pump em for all their worth (quite a bit). And pay them while you are at it. Those are the folks with a good strong clue … they’ve likely built/programmed systems from a micro level on upwards. Why no summing of clock speed? Simple. Its simply wrong. Because the instruction graduation rate per core does not change if you have 100 or 1000 or 1M cores. So taking an external property (number of cores) and trying to make it internal to the machine (clock frequency) is quite a bit of a fallacy, and would be exposed at the first microbenchmark. External properties are, exactly that, external. Internal properties (size of caches, speed of clocks, size of ram in a single computing name space) are internal. You don’t really have 1TB of ram on your cluster, you have Nx 32GB ram machines, unless you run vSMP from ScaleMP. Then you have 1TB ram. That is, mixing an internal and external set of properties makes one state incorrect things. vSMP lets you adjust those boundaries to a degree. Gluster (and most other parallel file systems) do not. This is not a rip on Gluster. This is a discussion of why its wrong to do this. If I take a simple program with a simple loop, no memory access and run it on a single core of a large cluster system, adding nodes to this system will NOT make that code go any faster. That is, the number of nodes, and all the properties that this brings in, are external to all aspects of single core performance. So pretending that N cores x M GHz yields a single core at N x M GHz is simply wrong. The speed of my program will be rate limited by the slowest element in the computing pathway. And the same is true on the storage side. Take N (some large number of) machines. Put 1 or 2 “fast” disks in there. SATA, SAS, whatever. Say each one can read/write at 150 MB/s (0.15 GB/s). Does this mean you have a storage system with N x 0.15 GB/s performance? Hell no. Thats the theoretical _aggregate _ performance, not the actual single thread storage performance. Yet …. yet …. many people building Gluster systems seem to use this design pattern, and assume that the theoretical aggregate performance is the single thread storage performance. Simplest take away: Stop it. Don’t do that. It won’t work well. Replication adds more overhead, as Gluster does not return to the writer until both copies have been written. So at best case, replication is the same speed as non-replicated. In reality, its usually 1/2 as fast or worse. Take these nodes, interconnect them with 1GbE. Your performance should be N x 1GbE for the cluster, right? Again, wrong. That 1GbE is suffering contention. You will get a fraction of it. Indeed, most of the people we’ve worked with, with these designs, insisted that they should be able to get line rate out of the network, not realizing everything else they were missing. Even with “fast” networks, you simply change out the slowest link in the chain for the next slowest link. After like the 10th call about this over the last month or so, It started to bug me. Were people really going the do-it-yourself route and building bad designs like this? Who, exactly, is encouraging this? Sadly, I think it might be Red Hat doing the encouragement. Looking at the web presence, I saw this

They are encouraging people to get the software and plop it onto “cheap hardware”. And all that entails. Reality (especially performance issues) has a habit of slapping you hard when you make critical design mistakes. Folks, if you want high performance storage systems, your systems have to be actually architected, from the ground up, for high performance. Putting a Ferrari engine in a VW Beetle will not turn the VW Beetle into a Ferrari. Likewise, a mass of 1000 VW Beetles won’t any time soon, merge into a single super-Beetle with 1000 times the performance of a non-merged Beetle. If you start out with a crappy design, you are going to get, curiously enough, crappy performance. I’ve heard enough stories of 1-10 MB/s (yes, you read that right) performance out of Gluster storage clusters with replicated distributed volumes, using “the latest gear” from “tier 1” brand names … Our storage clusters get far more than that. Starting at 500 MB/s per single IO user, on up. This has everything to do with design and implementation. The parallel file system has to work, and work well, for the cluster part of the storage cluster or scale out NAS to work. But the storage piece has to be blazingly fast, as the storage cluster/scale out NAS software invariably has inefficiencies that cut down performance. So is it better to have 50% efficiency on a multi GB/s storage system designed for performance or 50% efficiency over a shared/contended for 1GbE link with a 0.1 GB/s storage system on the back end? That is the real choice, and sadly, the folks pushing the software in this case, are doing their customers a disservice by suggesting that putting that software on a pile-o-cheap machines is going to provide good performance. They really need to qualify their statements. Rather tellingly, the benchmarks I’ve seen handed out to others (up to and including the last several months), are the ones WE generated 3.75 years ago for a customer acceptance test … on PURPOSEFULLY DESIGNED FOR HIGH PERFORMANCE STORAGE UNITS. Our first siCluster installation. Unfortunately, people have read into this that the software is a silver bullet. Its not. Hardware design and implementation matter. In this case, they matter far more than the software does. If you start out with a crappy design, you are going to get crappy performance. No kernel tuning will help. No file system changes, no new parameter on the boot line will make a difference. A poorly designed (hardware) system is going to suck, by every measure of the word “suck”.