More fast rabbits ...

By joe

March 14, 2007 - 4 minutes read - 648 words

The rebuild finished. Rebooted, not sure why we were getting the oddities we did. On the test track. Open it up, just a little.

As I am sitting here, I am watching it spill 500-700 MB/s to disk in writes. Our test case is 2x larger than physical memory. Caching isn’t relevant for reading and writing here. Now it is switching into “Reading intelligently…”. To understand why this is so interesting, here is some dstat output again.

Note those 900+ M numbers. That 900+ MB/s being read from platters, not from cache. At 2x memory, this stuff just is not sitting in cache. There just isn’t enough room. For laughs, I am rerunning and watching using atop. According to atop, during the writes, that 600+ MB/s set of writes, we were using up only 30-35% of the bandwidth of each link. This means ballpark of 2-3x more potential performance, though I am going to bet that we would see about 2x in the optimally tuned case. Of course, these things are also code dependent; quite a few codes will hit their own internal walls due to design before they hit the systems walls. The reads are showing at 40-45% utilization, so I think we are closer to what we should expect. Assuming about 65-70% efficiency. The other issue we are probably running into is the buffer cache, and the speed at which the processor can interact with it. On this system we see 1235 MB/s reading buffer cache. On other systems scattered about the lab we see 6778 MB/s, 1702 MB/s, 2123 MB/s. On a different JackRabbit we see 2102 MB/s. It is my belief that this is impacting the performance. Not sure why we are seeing this, as our streams numbers on this platform are pretty nice:

This suggests something amiss in the kernel and possibly some tuning bits to be done. Either that or we cannot trust the hdparm test as a reliable measure of cache performance. If we switch kernels and see this change, then I expect that this is something we shouldn’t worry about. Will run iozone now, save the results and use them for the benchmark report. Should be fun. Once this is done, we have some requests to load windows x64. I will need to build IOzone for windows, hope it is possible. Note: this is worth a short discussion. Each RAID controller has 18 disks organized in a RAID6. That gives us 16 effective disks. Each disk can read at about 70 MB/s, and write at about 45 MB/s. 16 disks worth (18 - 2 for parity) will give you a rough bandwidth of 1.1 GB/s. Per RAID controller, operating at 100% efficiency. We are seeing about 40-45% of this. The interfaces between each RAID controller and the machine are PCIe x8. This means that we have 4 GB/s duplex, 2 GB/s of bandwidth available in each direction. Of which the 16 equivalent disks could use 1/2, and we are seeing less than 1/2 of the 1/2 bandwidth. That is we are driving the reads on each controller at 1/4 or less of the pipe bandwidth. What this suggests is that we should use more and smaller RAID controllers, so that we use more effective bandwidth per RAID card. We have more than enough backplane bandwidth for the IO, what we need is to distribute the load across more RAID units. Ballpark guess of about doubling the performance. Whats nice is the amount of head room in the design. We aren’t hitting the limits as compared to other units which are badly oversubscribed on bandwidth and overall performance. We aren’t even near our limits, and this system roars. Our RAID6 numbers are quite a bit nicer than others RAID5 numbers (and in some cases, better than their RAID0 numbers). Imagine what would happen if we had all the controllers we needed.