A mystery within a puzzle ...
By joe
- 3 minutes read - 637 wordsIn some previous posts I had been discussing bonnie++ (not bonnie, sorry Tim) and its seeming inability to keep the underlying file system busy. So I hauled out something I wrote a while ago, for precisely these purposes (I’ll get it onto our external Mercurial repository soon). Push the box(es) as hard as you can, in IO. I built this using OpenMPI on the JackRabbit (JR5 96TB unit) and ran it. Before I go into what I observed, and I will show you snapshots of this, I should explain what the theoretical maximum read speed of this unit … should be … if everything is working in our favor. Each RAID is a 16 drive unit, 1 hot spare, 15 RAID6 drives. This gets us effectively 13 data drives. So if we can read/write at say, about 100 MB/s, we should see, very nearly, 1300 MB/s per RAID. With 3 RAID cards, we should be looking at about 3900 MB/s aggregate. Keep that in mind. Here is the ‘dstat -D sdc,sdd,sde’ output while reading:
Our run
mpirun -np 8 ./io-bm.exe -n 32 -f /data/file -r -d
read 32 GB on 8 processes from /data/file using direct IO (no caching)
0 0 100 0 0 0| 0 0 : 0 0 : 0 0 | 120B 460B| 0 0 | 334 242
0 0 100 0 0 0| 0 0 : 0 0 : 0 0 | 120B 460B| 0 0 | 333 242
0 0 100 0 0 0| 0 0 : 0 0 : 0 0 | 120B 460B| 0 0 | 328 233
----total-cpu-usage---- --dsk/sdc-----dsk/sdd-----dsk/sde-- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ: read writ: read writ| recv send| in out | int csw
0 0 100 0 0 0| 0 0 : 0 0 : 0 0 | 240B 1288B| 0 0 | 339 245
3 0 69 27 0 0| 47M 0 : 46M 0 : 45M 0 | 282B 562B| 0 0 | 765 330
17 3 29 51 0 0|1036M 0 :1010M 0 :1033M 0 | 180B 460B| 0 0 |7867 938
21 4 25 50 1 0|1195M 0 :1197M 0 :1200M 0 | 120B 460B| 0 0 |9105 1093
22 2 28 47 1 0|1178M 0 :1176M 0 :1175M 0 | 120B 460B| 0 0 |8983 1081
22 2 37 37 0 0|1204M 0 :1200M 0 :1205M 0 | 120B 460B| 0 0 |9136 1114
21 3 34 42 0 0|1178M 0 :1184M 0 :1179M 0 | 120B 460B| 0 0 |9069 1322
21 3 25 50 0 0|1177M 0 :1176M 0 :1178M 0 | 120B 460B| 0 0 |8960 1206
21 3 25 51 0 0|1129M 0 :1130M 0 :1129M 0 | 120B 460B| 0 0 |8575 1065
21 2 26 50 0 0|1160M 0 :1154M 0 :1161M 0 | 120B 460B| 0 0 |8826 1075
It looks like I have some sort of bug in the code, as this much traffic does appear indeed to be transiting the RAID card. We are really sustaining about 3300 MB/s. But for reasons I don’t quite understand, it appears we are throwing away most of it. The code reports a benchmark about 1/8th of the 3300 MB/s. Running a single thread shows me numbers I believe. 804MB/s. Dstat output supports this. Running two threads (ok, processes) increases the traffic on the RAID to 1500 MB/s or so, but the read time takes 3 seconds longer, and reports a lower bandwidth. I am assuming I have a bug in this code (as well as contention issues). I’ll look today. This said, it does appear that it is possible to drive this unit near theoretical maximum. Even if the code is broken in what it does, the traffic at the hardware level is independent of the bugs in the code.