Finding needles in haystacks covered in a fallen down barn

By joe

October 17, 2015 - 2 minutes read - 351 words

Ok … this one was very annoying. Imagine you are trying to diagnose a system crash on a production unit. The crash is quite specific in the subsystems … being one where the interrupt handler catches an exception, and then you start piling up softirq contexts. Its on the network side of things. You discover that the switch and the NIC are, somehow, incredibly, not quite compatible with each other. I can’t assign blame for this as I don’t know who is at fault. And you know, it happens to be a very popular card. It doesn’t seem logical that it could be it. But you’ve seen crashes in some drivers for the next version of this NIC. So you hold that it could be the card or driver. But … it shouldn’t be. Ok. Resolve this by swapping out card. Right? Well … not quite. Now the new NICs NAPI function is causing a different version of this. This is a downrev NIC from the other units in place with the same switch … it fixed the incompatibility of the NIC and switch … but … NAPI caused its own collection of softirq contexts to pile up. Ugh. Ok, there is a kernel bypass/offload capability you’ve used before with great success. Roll over to this. Wonderful! Largely solved the NAPI problem. Except you are still getting some skb_* GPFs. Its a 10 year old bug if you believe google. I don’t. Turn on more debugging symbols, and hope you get one you can trace. And then today, a GPF, and a pointer to the culprit. An errant kernel module that likely doesn’t check its allocator correctly. But to get to this, you had to clear out the painful switch interaction, the screwy NAPI softirq contexts, and a few other annoying things. This wasn’t really small signal with lots of noise. It was simply too much signal. And many of them. All mixed together. I now know what the problem is, and I think I know how to cure it. Working with the folks who make the module to debug it.