Previously I had told you about octobonnie. 8 simultaneous bonnies run locally to beat the heck out of our servers. If we are going to catch a machine based problem, it will likely show up under this wilting load.
But while that is a heavy load, it is nothing like what we have going on now.
I am sitting here in the office monitoring one of our boxes being tested by a customer before they put it into production (oil and gas market), as they load it from their cluster.
14 simultaneous bonnies running over gigabit. Oversubscribing the networks by more than 2x. Multiple mount points. Random load to the mounts … that is the mounts are randomly used for the bonnie.
Doing this as the channel bonding driver on Linux doesn’t withstand heavy load on any mode other than mode 0. Unit was fine until they set up channel bonding. Used mode 6. Caused a soft-irq lockup. Same one I have seen since 2.6.9 and before (this is our 126.96.36.199 kernel). Problem is likely still in 2.6.27 and 2.6.28, though haven’t tested for this specifically yet. Will do at some point.
Of course, when a partial kernel crash takes out an interrupt handler, other things tied to that pin (the way PCI works is it multiplexes its interrupt pins … interrupts can be assigned specific ports, but this is done in a “soft” manner) could be compromised. Yeah … I am sure some other OSes can survive interrupt handler crashes. Microkernels probably. Just restart the user space service.
But the IRQ went away. Which eventually took one of the RAID cards down.
Because when a RAID card gets confused, your file systems go south, awful fast.
So this is more or less an extreme loading test. After setting up the clients and mounts, we fired up the load generator.
We’ve done this sort of sustained test before. But we are running this until next week. We will see how well it handles it.
For the record, the 8 processor, 16 GB ram machine currently has a user load of about 30. And it is still quite responsive.