when failures stick out like a statistical sore thumb
By joe
- 2 minutes read - 359 wordsParts fail. Components fail. You have to operate assuming they will fail. A warranty is fundamentally a bet that parts will fail, and a willingness to place money (the price of the warranty) on that bet. Over time, with enough components, you get a feel for how often parts fail. You get historical data. When one subset of components have a high failure rate (e.g. Corsair SSD disks), you know you can isolate the problem. But what happens if you get a holistic failure? Say RAID cards, and disks, and power supplies. On units you’ve burnt in for a while, so you know that their shouldn’t be any parts failures? Its a hard to argue crappy parts when so many different subsystems have failures. Its easier to argue that something environmental is a problem. It simply fits the data much better. Otherwise you have to claim that we had correlated failures with subsystem X, Y, and Z within a time period. Which begs the question of what is common to all those subsystems? And if you see these failures not on a single unit, but on multiple units …
Yeah … having one of those now. Its obviously environmental. The end user has a problem, likely with the quality of their power source. Quite possibly a ground loop, or a floating AC neutral, or a large inductive load switching on and off, on the same unfiltered line. Or a faulty UPS. Or a brownout. There are simply too many correlated failures across multiple units for us to believe that it is a coincidence. Its gone from statistically improbable, down to extremely unlikely. This isn’t a bathtub analysis, this is a historical analysis. This one sticks out like a sore thumb. Well, we could be wrong. If the end user were playing with the unit, say hot plugging PCIe cards, or hot plugging power supplies with power cords in them during the hot plug … or not grounding themselves before working on the units, or having a dirty/dusty environment, or … … yeah, that would do it too. We are being hopeful that its an environmental issue that is correctable.