Now that I (think) understand most of the major issues here, and I can be reasonably sure that I have a good grasp of the tuning, I think I want to take it out on the test track and give it one final once over.
Lets open the throttle. Wide.
I can tune the IO scheduler, number of outstanding IO requests (for sorting), various buffer cache, and the works. I now have the clock left alone (need to set it that way by default), so it is running full speed.
Sanity check: How does buffer cache look?
root@jackrabbit1:~# hdparm -tT /dev/md0 /dev/md0: Timing cached reads: 4908 MB in 2.00 seconds = 2455.69 MB/sec Timing buffered disk reads: 2060 MB in 3.00 seconds = 685.95 MB/sec
Good-n-fast. None of this 1.1 GB/s stuff we see when powernow is on.
Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP jackrabbit1 32096M 568841 85 279055 54 834536 82 491.6 0 jackrabbit1,32096M,,,568841,85,279055,54,,,834536,82,491.6,0,,,,,,,,,,,,,
Not bad. Saw as high as 950MB/s sequential input in testing. Saw 600+ MB/s in testing. Some additional IO tuning is possible (deadline will favor writes over reads).
Since we have 2GB RAID cache, 1 GB per card, we need to let this get past the RAID cache boundary. This was one of my objections with some others testing we had seen in the past. Their test cases were entirely cache bound, and therefore effectively meaningless as an indicator of performance (other than cache performance). Lets get out of the RAID cache regime, into the region where it needs to spill to disk. This hits the power curve hard. If your performance falls off a ledge at the size of your RAID cache, you have … problems.
For laughs, I grabbed a snapshot of a few seconds of dstat, running in a window above iozone.
Each adapter can provide about 780 MB/s during writes. We are mostly filling 2 of these up. We can add more.
The corresponding line from IOzone is
2097152 1024 699383 938361 1333639 1344250 1342067 1147908 1343132 1154050 1342384 555308 708427 1328317 1340149
I would argue that we are still in cache. This is bursty, and still only 2GB of IO.
Looking at a few lines of dstat while we are at the 4 GB size (64k record size), I see
with a corresponding output line of
4194304 64 749795 1027169 2223807 2255197 2209095 1239858 2256997 3067745 2227245 754630 963514 2191593 2228561
This is outside of RAID controller cache. Running atop and looking at the IO per controller, it is reporting 15-50% utilization for these cases. We have head room.
All of this running RAID6.
Will work on the benchmark report. This kernel is a step back in version, and we lose about 5-6% performance as compared to the later model kernels. But we are seeing sustained data bolus of 0.8-0.9 GB/s and better, with bursts to 1.3-1.5 GB/s.
Since we are using PCIe x8 controllers, we have 4GB/s to work with (bidirectional), 2 GB/s. Actually 86% of that is available due to the way PCIe works. This gives us a maximal bandwidth per controller of about 1.7 GB/s. 2 controllers gets us to 3.4 GB/s. 4 would get us to 6.8 GB/s. The problem is that the DMA transfers to and from the memory will likely be the bottleneck above this. At 1.7 GB/s we can support 24 disks per controller running at 70 MB/s (their current speed, and the current number of disks per controller). We can spread this out among more, and lower the load/contention per controller. This looks like it will increase the speed, rather significantly.
Overall, I am quite pleased with this. This unit does appear to gain from using an external log for xfs, as well as better tuning of the number of requests per device, the io elevator, and other bits. I will update when I have graphs and analysis later on this week.