A mystery within a puzzle ...

By joe

July 12, 2009 - 3 minutes read - 637 words

In some previous posts I had been discussing bonnie++ (not bonnie, sorry Tim) and its seeming inability to keep the underlying file system busy. So I hauled out something I wrote a while ago, for precisely these purposes (I’ll get it onto our external Mercurial repository soon). Push the box(es) as hard as you can, in IO. I built this using OpenMPI on the JackRabbit (JR5 96TB unit) and ran it. Before I go into what I observed, and I will show you snapshots of this, I should explain what the theoretical maximum read speed of this unit … should be … if everything is working in our favor. Each RAID is a 16 drive unit, 1 hot spare, 15 RAID6 drives. This gets us effectively 13 data drives. So if we can read/write at say, about 100 MB/s, we should see, very nearly, 1300 MB/s per RAID. With 3 RAID cards, we should be looking at about 3900 MB/s aggregate. Keep that in mind. Here is the ‘dstat -D sdc,sdd,sde’ output while reading:

Our run mpirun -np 8 ./io-bm.exe -n 32 -f /data/file -r -d read 32 GB on 8 processes from /data/file using direct IO (no caching)

 0   0 100   0   0   0|   0     0 :   0     0 :   0     0 | 120B  460B|   0     0 | 334   242
  0   0 100   0   0   0|   0     0 :   0     0 :   0     0 | 120B  460B|   0     0 | 333   242
  0   0 100   0   0   0|   0     0 :   0     0 :   0     0 | 120B  460B|   0     0 | 328   233
----total-cpu-usage---- --dsk/sdc-----dsk/sdd-----dsk/sde-- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ: read  writ: read  writ| recv  send|  in   out | int   csw
  0   0 100   0   0   0|   0     0 :   0     0 :   0     0 | 240B 1288B|   0     0 | 339   245
  3   0  69  27   0   0|  47M    0 :  46M    0 :  45M    0 | 282B  562B|   0     0 | 765   330
 17   3  29  51   0   0|1036M    0 :1010M    0 :1033M    0 | 180B  460B|   0     0 |7867   938
 21   4  25  50   1   0|1195M    0 :1197M    0 :1200M    0 | 120B  460B|   0     0 |9105  1093
 22   2  28  47   1   0|1178M    0 :1176M    0 :1175M    0 | 120B  460B|   0     0 |8983  1081
 22   2  37  37   0   0|1204M    0 :1200M    0 :1205M    0 | 120B  460B|   0     0 |9136  1114
 21   3  34  42   0   0|1178M    0 :1184M    0 :1179M    0 | 120B  460B|   0     0 |9069  1322
 21   3  25  50   0   0|1177M    0 :1176M    0 :1178M    0 | 120B  460B|   0     0 |8960  1206
 21   3  25  51   0   0|1129M    0 :1130M    0 :1129M    0 | 120B  460B|   0     0 |8575  1065
 21   2  26  50   0   0|1160M    0 :1154M    0 :1161M    0 | 120B  460B|   0     0 |8826  1075

It looks like I have some sort of bug in the code, as this much traffic does appear indeed to be transiting the RAID card. We are really sustaining about 3300 MB/s. But for reasons I don’t quite understand, it appears we are throwing away most of it. The code reports a benchmark about 1/8th of the 3300 MB/s. Running a single thread shows me numbers I believe. 804MB/s. Dstat output supports this. Running two threads (ok, processes) increases the traffic on the RAID to 1500 MB/s or so, but the read time takes 3 seconds longer, and reports a lower bandwidth. I am assuming I have a bug in this code (as well as contention issues). I’ll look today. This said, it does appear that it is possible to drive this unit near theoretical maximum. Even if the code is broken in what it does, the traffic at the hardware level is independent of the bugs in the code.