Posts
SSD failure issue explained
I think I understand most of it. The OEM went dark on communicating with us. This is unfortunate. I really wish they hadn’t. Suffice it to say we are replacing all of the units of theirs we have in field. We have documentation up on our documentation site on how to do the swap out. It appears to be a bad batch. I am guessing (until we learn otherwise) that they got some bad silicon at some point.
Posts
9/11 memorium
Not an HPC topic, but one that all Americans can reflect upon … their thoughts, their experiences … and to resolve not to let this happen ever again, in any form. Never again. I had left SGI, and was working for a smaller engineering software company. I had lined up a bunch of interviews for an open position we had. My buddy Al was flying out from NY with his team to visit someone I know in Ann Arbor, and I was going to try to grab them for lunch or dinner.
Posts
Oracle and Netapp bury the ZFS patent hatchet
Color me skeptical. I don’t think there has been an actual resolution of the core issues, just an agreement to abstain from legal hostilities. See this article for some information. ZFS will continue as a patent encumbered software stack, and since no resolution exists between Oracle and Netapp, Netapp could, potentially, press its claims against any other customer/user of ZFS. This isn’t a good thing for anyone using or contemplating using ZFS, from any source other than Oracle.
Posts
... and the day job desktop disk blowed up ...
Well, the OS drive did anyway. The home directory data is happily living on the RAID unit. Curiously, we just got our BackupPC based backup unit set up, and it was backing this unit up … Oh well. No great loss, apart from setup time for the new OS drive(s). Will do a software RAID 1 this time. And likely make it an SSD pair.
Posts
Ceph updates
rbd is in testing. Have a look at the link, but here are some of the highlights
* network block device backed by objects in the Ceph distributed object store (rados) * thinly provisioned * image resizing * image export/import/copy/rename * read-only snapshots * revert to snapshot * Linux and qemu/kvm clients We are doing something like this now, to a degree, with a mashup of tools in our target.
Posts
Day job has a "Cash for Clunkers" program up
For those who don’t get the reference, “Cash for Clunkers” is a colloquialism meaning a hardware trade-in program for old gear. PR is here, and direct link to the site itself here. Basically, we’ll take old [HPC] storage gear and provide a discount to you for taking this old gear. There are limitations on which old gear, as well it must be operational and in working order, etc. . Also you are responsible for shipping costs.
Posts
We've come a long way in 13 years ...
Have a look at today’s Google home page, and you see the 25th anniversary of buckyballs, aka fullerene, which are particular structures made out of carbon. These fullerenes are very much related to graphite (pencil “lead”), and have some very interesting physics and chemistry of their own. They were discovered when I was in my Sophmore/Junior years as an undergraduate. Not feeling old. Nosiree. This isn’t what the post is about, and yes, there is a huge connection to HPC.
Posts
Passing of torches in the industry
One constant in the world is change. You can fear it, or learn to embrace it. This is part of all markets, HPC and many others. This morning we woke to news that Rich Brueckner bought InsideHPC. Rich is a good reporter and writer, has a marketing firm, and has been working in HPC for a while. We spent time as colleagues at SGI/Cray (yeah, its been a while). Rich now owns one of the brightest brands in HPC news and information.
Posts
Workaround for the SSD RAID1 dual drive failure mode
At least it can keep things operating while we get parts out. Shipped a number today, placed another order for the new vendors drives. I can confirm that heat is an issue with SSDs.
As root, on the unit, decide where you are going to place an image, then dd if=/dev/zero of=/path/to/loopback/raid.img bs=1 count=1 \ seek=32G losetup /dev/loop0 /path/to/loopback/raid.img mdadm --grow /dev/md0 -n3 mdadm /dev/md0 --add /dev/loop0 This will copy the OS to the a file, and we can (later if need be) recover from problems.
Posts
Never seen anything like this before. Ever.
[update] I am starting to think some of these things are heat issues. Not generated heat, but ambient heat. I took a pair of SSDs from our lab (running quite warm, 5 ton AC unit ready to be hooked up) which were giving minor read errors (different vendor), and they operated flawlessly in my basement lab (quite cold) hmmmm….. I see some SSDs being sacrificed in the near future. I am starting to wonder if any of them has been actually tested in the higher heat environments.