I thought there might be some interest in this, given some of the posts we have done last year. If I hit article length (suspect I will), then I’ll submit it for publication.
As a teaser, the baseline version of rzf for arguments -l 1000000000 -n 2 takes 34.83s on my laptop CPU, while the SSE2 version of the same code takes 7.62s. As this is a Cuda enabled laptop, I intend to get this version going as well shortly.
[update] I would be remiss in not noting that the nehalem processor in our lab gets the code done in 3.3s, while an old Opteron 275 unit takes 8.39 s.
You get the idea.