Quite a few people asked offline for a fortran version of the tests I indicated in the last post on this subject. So here is the basic code and its performance. What is interesting is that, for the same inputs as the C code, with no heroic loop unrolling/unwinding, the Fortran is about 3x faster than the C. Well, except for ifort. But we will get into that more later on.
program rzf integer i,j,k,milestone,n,rc,inf double precision total,p_sum,sum,delta_t,pi, pi_exact n = 0 sum = 0.0d0 pi_exact = 4.0d0*atan(1.0d0) print *,"Enter the order of the Riemann zeta function" read *,n print *,"Enter the limit of the summation (effectively infinity)" read *,inf do k=inf-1,1,-1 sum = sum + dble(k)**(-n) enddo print *,"zeta(",n,") = ",sum if (n .eq. 2) then pi = sqrt(sum*6.0d0) print *,"pi = ",pi print *,"error in pi = ",pi_exact-pi print *,"relative error in pi = ",(pi_exact-pi)/pi_exact endif end
So I created a file named “in” which has these 2 lines.
These are the same as inputs I used in the other code.
With basic -O optimization
This is about 3x the performance (for gfortran 4.2.1, pgf90 7.1-1) and a little better with ifort 10.0.023.
Hmm. Ok, lets turn on maximum optimization and see what happens.
Ok, I won’t dig into the generated assembly language that much, but I am willing to bet that we would see what we saw before. SSE used with half registers.
Before we do that though, note that I exploited the intrinsic fortran power function with an integer argument. My guess is that it replaced the function call to pow with a simple loop, much like my hand optimized C version from the previous example.
Well, it is a little more complex than that. Notice how I used k**-n or more precisely, dble(k)**(-n). This effectively avoids a division. Divisions are expensive. And you might note that I used one in the C code. Ok, lets change than line back to
sum = sum + 1.0d0/dble(k)**(n)
What happens to the optimized performance?
Wow. ~10% performance delta (better) for letting a “more expensive” operation stay in.
Ok, what about a loop unrolling? What happens if I do the same thing I did with the C code? Not heroic effort, but reasonable effort?
Here is what the code looks like now:
program rzf integer i,j,k,milestone,n,rc,inf double precision total,sum,delta_t,pi, pi_exact double precision l(4),p_sum(4),d_pow(4),unroll_array(4) integer start_index,end_index,unroll n = 0 sum = 0.0d0 pi_exact = 4.0d0*atan(1.0d0) unroll = 4 print *,"Enter the order of the Riemann zeta function" read *,n print *,"Enter the limit of the summation (effectively infinity)" read *,inf start_index = inf - mod(inf,unroll) end_index = unroll print *," start index = ",start_index print *," end index = ",end_index do i=1,4 p_sum(i) = 0.0d0 unroll_array(i) = dble(unroll) enddo do k=inf-1,start_index+1,-1 sum = sum + dble(k)**(-n) enddo do i=1,4 l(i) = dble(start_index-i+1) enddo do k=start_index,end_index+1,-unroll do i=1,4 d_pow(i) = l(i) enddo do j=n,2,-1 do i=1,4 d_pow(i) = d_pow(i) * l(i) enddo enddo do i=1,4 p_sum(i) = p_sum(i) + 1.0d0 / d_pow(i) enddo do i=1,4 l(i) = l(i) - unroll_array(i) enddo enddo do i=1,4 sum = sum + p_sum(i) enddo do k=end_index,1,-1 sum = sum + dble(k)**(-n) enddo print *,"zeta(",n,") = ",sum if (n .eq. 2) then pi = sqrt(sum*6.0d0) print *,"pi = ",pi print *,"error in pi = ",pi_exact-pi print *,"relative error in pi = ",(pi_exact-pi)/pi_exact endif end
And this is the performance:
Ok, this is showing that additional (possibly heroic) efforts to do non-algorithmic shift optimization in Fortran are going to give diminishing returns at best. It also shows that most of the fortran compilers do a pretty good job on optimization of simple loops.
It also shows that you have to work fairly hard (heroic efforts) to get the C compilers we tried to give good optimizations.
It is not wise to draw general conclusions about relative language performance from such a simple test. It does suggest that some serious thought should go into choice of implementation languages, that simply neglecting one because it is 52 years old and may not be the “hot” thing, may have a (potentially significant) negative performance impact.
What this does show me though is that none of the compilers make particularly efficient use of the systems resources. Why this is, I am not sure. Maybe a compiler designer/builder can correct my understanding.
SSE is a remarkable development, albeit not as well designed as one might hope. Compilers, while able to emit code for SSE don’t seem to do an exceptionally good job of mapping problems onto the SIMD (“vector”) registers.
As we increase the numbers of cores per physical processor, it would be good to make sure that we are using them more efficiently. This is unfortunately a simple matter of programming (SMOP). That is, it is a problem that is pretty nearly unsolvable, which is what SMOP really means. Sort of like the joke about the professor and the student, where the student asks for an explanation of an example that the professor called “trivial”. Hours into the effort of understanding the work, the student says, “yes, it is trivial”.
Compiler code generation efficiency is critical to good performance, but so is an understanding of the underlying processor architecture, and how to map code into it.
If I have time, I will try to rewrite these in SSE macros for C/Fortran. Also, my nVidia 8800 GTX came in, and it is meant for Cuda work. I would like to see if we can do something there. Yeah, I know it is single precision … Maybe on the Cell as well. Would be fun.