Topical #HPC project at the $dayjob

By joe

April 25, 2020 - 9 minutes read - 1739 words

(caution: I get personal at the end of this, you can see my motivation for working on this)

I can’t talk in depth about it though, yet. I can talk in broad brush strokes for now.

Imagine for a moment, that you have a combination of available high performance supercomputers, an urgent problem to be solved, and a collection of people, computing tools, and data. Imagine that you are one of many stages in this process, but you are, for the moment, a bottlenecking process. You are gating others work.

Did I mention that this is an urgent problem, needing solution? Yeah, I think so.

Ok.

My small portion of this is, at the moment, to take some of these tools, and adapt them to the fire breathing super that they are going to run on. To explore every opportunity to have these things hit warp speed. To, paraphrase Seymour Cray, turn the problem from a computational one, to something else.

So you start with a code. It’s not a particularly HPC-like code, it doesn’t take a long time to run in most cases. A few minutes on a single CPU. But, as part of this process, you have to run many … many runs of this. And the code is not written with this in mind. To make matters more complex, the super is used to runs that last a long time.

The challenge is to teach the code about the super. To help the super understand that it can be a throughput engine. And help make many other portions of this just work the way you need it.

Because this is an urgent problem. And while you are working on a small part of this problem, your part is a bottleneck.

Supercomputers are different sorts of beasts than clusters or cloud based clusters. They are designed to push limits, to minimize time to completion. Code running on them has tremendous available resources in CPU, IO, networking. To get good performance, you need to make sure the code and system are well adapted.

First task is to measure performance of the code with test cases. To understand where it spends time. There are tools for the latter, and they work quite well. The former is a fairly standard process, though if you are not running on an exclusive bare-metal environment, you will likely see additional noise from the system hypervisor.

Scaling testing has gone well, you ran an ensemble sample of runs, and you have approximate speedup curves. You see how to do resource allocation to keep efficiency reasonably high, while getting better than single core performance.

This code is statically compiled, so you need to rebuild it to add in profiling capability. It turns out that this is not trivial for a number of reasons. The code was built with boost, a C++ library, and coded specifically to a 12 year old version of the library. You can either build that old library, or forward port the code to the newer boost library. Starting with the old library, you discover quickly how much the C++ language and compilers have changed. You were able to get it to build (finally), but now it seems that there is a setting that was triggered by the use of a later model compiler which disabled threading in the library. And the code uses the boost threads.

Oh. Yay.

So … plan B, forward port.

Turns out, this one was the faster/easier one. Only a few specific fields changed names in some of the methods. I was able to get this done in about an hour, with appropriate googling … well … duckduckgo-ing… I switched search engines about a month ago, to see how well it does. Quite well, but that is a post for another time.

Now, lets redo the computations. Same results, so I’m happy. Compile with a modern compiler, add in the right options to retain symbols and save profiling data.

A fairly flat profile. And as we add more threads, the thing spends more time in library overhead.

I could talk about why pthreads are not a great way to parallelize here. Or I could simply note that we don’t have the time to pursue this.

Remember, urgent.

So we play the game with the cards we are dealt. Some minor code tweaks, some compilation option tweaking. Took an originally 100-ish second run down to 9 seconds on a single node.

Now we explore supercomputer environment optimization. Our scheduler is configured for a different use case. And this is fine. But we need to adjust it for a few specific use cases. So the team works on that.

With the updated scheduler, now lets submit runs. My enqueuing system is brute force, and naive. I figured I’d fix that part later.

Using the scheduler to load balance, fire off 120 jobs. Ok performance, noticeably better than before. But individual job times bug me. There is something off. Still, though, 450s for 120 runs is easily a factor of nearly 10 over where it was before. This is good.

After getting a good night sleep, I hop back on. I walk through each stage. I watch the job submit. I watch it run and …

Why does it behave as if it were a single CPU run? I mean, seriously … we have all the right options on job launch …

Ok, so supers use a set of mechanisms to start programs. Most are assumed to be MPI based, and thus there are tools specific to this case around running these processes on nodes. To get the best performance these processes are very tightly controlled.

They do some things to control how to lay out memory, numa node, and … cpu binding.

Yeah, I’d forgotten to control for this. Let me fix that.

Do my 120 job run.

It finished in about 20 seconds.

Wait. Wut … huh?

I must have messed something else up. Ok, lets look at the output.

Hmm. It looks like it worked. Let me try that again.

(now gentle reader, picture your intrepid author with a dumbfounded expression on his face as he tries his run again and again, amazed that this is now … much faster … and giving the same results as before)

Ok, we had added a JATO unit to the code via some of the optimizations before. This change added a SCRAMJet. This thing was screaming fast. The super was a fire breathing monster tearing through the work at ludicrous speed.

I could stop here. Our team was suitably impressed. We were able to push quite a bit of work through this system.

Remember, urgent problem. Needs supercomputing to help with solution. We have this. We’re done. Right?

Right?

Well, no. Something was bothering me about all of this. My naive mechanism for job submission had become the bottleneck, and I was worried about it. It was automated, but, it was naive. And brute force. And slow.

We were already like 20x better than before these changes. I should be happy.

But … I wasn’t. Naive, brute force things tend to be fragile. They tend to be hard to control.

So I went back to some assumptions I had made, and did some thinking. There’s this great programmatic feature I could leverage which would make the scheduler submit its own jobs based off of a template. This way, my submission of the template job will happen quickly. The runs will be enqueued by the scheduler.

This would require some work to pre-process a number of things. My first attempt at this pushed some of the pre-processing to the scheduler. Turns out that didn’t work well.

Fine. Build more logic into the pre-processor so that all that happens with the template job is setting up variables and running. I know how to do this. I did this 21 years ago with another code named SGI GenomeCluster. Same basic process.

So, I worked on this. I got the pre-processing working beautifully. I find sometimes I can’t think through algorithms with writing them down so I can see the mathematical structure. There was a specific functional mapping I needed to make work, for this. Fairly simple transformation.

The preprocessing step took about 5 seconds total for the full run (many tens of thousands of jobs, #perl is amazing as a data processing engine). The job submission after that took about 2 seconds. And then I watched.

This was the functional equivalent of attaching a warp drive to the system. Something that took the better part of a day to run tens of thousands of jobs previously … was complete in something over 8 minutes. In the broader scheme of things, we removed a major aspect of the computational bottleneck.

I’ve been telling people for decades that our goals in HPC should always be to move bottlenecks to be outside of our domain. Seymour Cray was one of the first to say that a supercomputer was a machine to turn a computing problem into an IO problem.

Using such a system, we did turn a computing problem into an IO problem of sorts.

Because this is urgent. All hands on deck. This is important work.

And I’ve loved every moment of working on this.

In January we lost my mother-in-law Ofelia to an acute leukemia. It was horrible in all aspects, very rapid onset from thanksgiving time frame, until terminal decline at the end of last year. I was ill over the new years, so I didn’t get to see her until the 7th of January. The next time was at the service. This hit my family hard. Terrible and rapid onset disease takes a toll on more than just those who suffer from it. It takes a toll on families and loved ones.

Back to this work. Because it is urgent. I think I mentioned this a few times.

My new boss at $dayjob asked me in late January, what I wanted to do at the newly absorbed company. What I wanted to say was “I want to help researchers attack cancer. Relentlessly, fearlessly, with all the computational power that I can help muster.” That’s what I wanted to say. I didn’t say that.

I still would like to do that someday.

But this need is more urgent. And I can help.

And I think Ofelia would be proud of what we’ve done.