SuSE has been resisting me running it diskless. Actively resisting. The end result was that I had to build a custom kernel (we are using/supporting 18.104.22.168 right now), making sure to build nfs, and networking in, and not as modules. What I learned was that even if you think you have built everything, you could leave important little things off. And Murphy’s law dictates that those left off things are important.
I spent the morning fighting over a driver that said it was happy, then it disabled itself. Finally by adding in nearby drivers, it seamed to take. I don’t have a good understanding of this.
Also, I inadvertently hard installed infiniband into the kernel, as compared to using OFED. So I had to back that out. iSER/iSCSI are built in. NFS patches, whole 9 yards. A 4 processor node with 16 GB ram goes from power off to fully booted with all services in under 45 seconds (including infiniband, queuing, NFS, …). Some of the delay, I am not sure about.
Also, I saw that some of the drivers do crash. I think I see why. I also question the wisdom of including nscd in any distribution. I have never seen it do anything of value, and it has caused pain.
As a test, I ran our PTB test, and sure enough, it ran across all nodes via Infiniband. It was neat to set up the IPoIB, and ping across the IB. The MPI stacks will use native IB or DAPL, but it is still nice to have IPoIB as at least a diagnostic.
With SuSE now functional, we can diskless boot SuSE, Ubuntu, Redhat on these nodes. We are looking into pxe booting windows xp (though with 4 cores and 16 GB ram, I am not sure this would work with XP … maybe XP 64, or 2003 server x64).
Atop this machine, the DragonFly is landing. Hopefully more news later.