DragonFly milestone
By joe
- 6 minutes read - 1244 wordsA long time ago, on a computer not so far away, we built a program called “SICE”. Yeah, I am not known for naming things well. SICE’s entire purpose in life was to be a user centric interface to HPC systems. When users wanted to run jobs, they filled out a web form that described the job, and off it went. This was not similar to other things out there in the market. This was designed not to sell queuing systems, or other bits. SICE was all about making peoples lives easier in using their systems. Odd concept that.
Of course, the first version wasn’t that good. Basically a fancy CGI script. Some additional bits to drive a script which drove a program. Each new program required a new script. And a new web page. To say it was unwieldly was, well, to be quite honest. Second version. Oddly enough, I called it SICE v1. One of the day job’s larger customers said “hey cool” and “can you integrate these programs into it”. Which we did. There is a long story behind some of this, getting it up and running. An object lesson in not believing people when they say “oh yes, it is a good input deck”. I have taken to saying “prove it”. My skepticism, as it turns out, is quite well founded. Had I only had the skepticism in place earlier, I could have saved months of effort (I kid you not). Said customer wanted support, though like many things, the money side of that never showed up. We did provide at least baseline support. We learned quite a bit about what was wrong with version 2 (or v1) in the process. A script per code is horrible. Each script is different. Trying to wrap everything to fit in our idea of a framework turned out to be a bad design decision for a number of reasons. Most folks are currently doing this. However, some of the technology we had developed (starting in 2002!) which made its way into this system turned out to be spectacularly good. So we started planning and planning for version 3 which I started calling SICE v2. I finally tired of that name, and called it DragonFly. This is not DragonFly BSD. Shouldn’t be any confusion whatsoever. Started working on DragonFly in early 2006. Set up some things, tore them down. During this time we did a few technological shifts that helped us make the coding a great deal saner/easier. Decided in this interval to make this one dual licensed (previous was open source). Did this planning/testing for 1.5 years, until, finally, said customer indicated that they were interested. Nothing focuses you like the need to deliver product. So we accelerated the coding. The major issues are that adding new codes should be simple. Very simple. Usage is web based. Should run everywhere, the acid test is if I can submit jobs from my cell phone. There is so much more to this, this is just the tip of the iceberg. More soon. I promise. But the milestone: DragonFly generated it’s first working job tonight. Not submitted into queue, we are waiting on some other development to complete for that, probably around thursday.
landman@dualcore:~/build_job$ rm -f batch* ; ./build_job.pl --job=job_entry.xml --program=ptb.xml --debug
D[27790]: os = 'linux'
D[27790]: directory = '/home/landman/build_job'
D[27790]: opening temp file in directory ....
D[27790]: project='testing1'
D[27790]: keeping default environment
D[27790]: delete environmental variable 'PGI_BITS' [current value = '64']
D[27790]: add environmental variable 'alpha' [current value = '']
D[27790]: add environmental variable 'beta' [current value = '']
D[27790]: add environmental variable 'gamma' [current value = '3']
D[27790]: param substitution: Parameter '_NCPU_' = '5'
D[27790]: Using MPI: stack = 'openmpi123'
D[27790]: - MPI mpibin = '/apps/openmpi123/bin'
D[27790]: - MPI mpirun = 'mpirun'
D[27790]: - MPI mpiargs= '-np _NCPU_'
D[27790]: - MPI runcmd = '/apps/openmpi123/bin/mpirun -np 5 '
D[27790]: executable = '/home/landman/bin/ptb.exe'
And yes, the resulting script did in fact run …
D[tid=0]: arg[0] = /home/landman/bin/ptb.exe
D[tid=0]: arg[1] = -n
D[tid=0]: n found to be = 1000
D[tid=0]: should be 1000
D[tid=0]: arg[2] = 1000
D[tid=1]: arg[0] = /home/landman/bin/ptb.exe
D[tid=1]: arg[1] = -n
D[tid=1]: n found to be = 1000
D[tid=1]: should be 1000
D[tid=1]: arg[2] = 1000
D[tid=2]: arg[0] = /home/landman/bin/ptb.exe
D[tid=2]: arg[1] = -n
D[tid=2]: n found to be = 1000
D[tid=2]: should be 1000
D[tid=2]: arg[2] = 1000
D[tid=4]: arg[0] = /home/landman/bin/ptb.exe
D[tid=4]: arg[1] = -n
D[tid=4]: n found to be = 1000
D[tid=4]: should be 1000
D[tid=4]: arg[2] = 1000
D[tid=3]: arg[0] = /home/landman/bin/ptb.exe
D[tid=3]: arg[1] = -n
D[tid=3]: n found to be = 1000
D[tid=3]: should be 1000
D[tid=3]: arg[2] = 1000
0 [tock: tid = 4 on dualcore]: next_tid= 1
0 [tick]: tag = 0 next_tid = 1
0 [tock: tid = 0 on dualcore]: next_tid= 1
0 [The Buck tid = 0, machine= dualcore] I have _the_buck_, pas
sing to tid = 1
0 [tock: tid = 1 on dualcore]: next_tid= 1
0 [Receiver tid = 1, machine = dualcore] waiting for the _the_
buck_
...
989 [Receiver tid = 0, machine = dualcore] recieved _the_buck_
990 [tick]: tag = 0 next_tid = 991
990 [tock: tid = 0 on dualcore]: next_tid= 1
990 [The Buck tid = 0, machine= dualcore] I have _the_buck_, pas
sing to tid = 1
991 [tick]: tag = 0 next_tid = 992
991 [tock: tid = 0 on dualcore]: next_tid= 2
992 [tick]: tag = 0 next_tid = 993
992 [tock: tid = 0 on dualcore]: next_tid= 3
993 [tick]: tag = 0 next_tid = 994
993 [tock: tid = 0 on dualcore]: next_tid= 4
994 [tick]: tag = 0 next_tid = 995
994 [tock: tid = 0 on dualcore]: next_tid= 0
994 [Receiver tid = 0, machine = dualcore] waiting for the _the_
buck_
994 [Receiver tid = 0, machine = dualcore] recieved _the_buck_
995 [tick]: tag = 0 next_tid = 996
995 [tock: tid = 0 on dualcore]: next_tid= 1
995 [The Buck tid = 0, machine= dualcore] I have _the_buck_, pas
sing to tid = 1
996 [tick]: tag = 0 next_tid = 997
996 [tock: tid = 0 on dualcore]: next_tid= 2
997 [tick]: tag = 0 next_tid = 998
997 [tock: tid = 0 on dualcore]: next_tid= 3
998 [tick]: tag = 0 next_tid = 999
998 [tock: tid = 0 on dualcore]: next_tid= 4
999 [tick]: tag = 0 next_tid = 1000
999 [tock: tid = 0 on dualcore]: next_tid= 0
999 [Receiver tid = 0, machine = dualcore] waiting for the _the_
buck_
999 [Receiver tid = 0, machine = dualcore] recieved _the_buck_
Last: The Buck = 1.000 has stopped here ... @ tid = 0, machine = dual
core
As a sanity check, I copied the metadata to a different machine with a different mpi path and installation, altered the metadata to reflect this, and ran the same job after the program generated the script. Ran correctly in both cases. The code above is my “Pass The Buck” MPI code. It ran on OpenMPI 1.2.3 on one platform, and OpenMPI 1.2.4 on the other platform, installed into different paths, … This is a good thing. There are many things exciting about this, not the least of which is that, it is designed to be cross platform (windows guys, are you listening?). As I said, more later. This is an important milestone for DragonFly, one of the critical elements of its emergence into the world.