Compiler quality
By joe
- 10 minutes read - 2013 wordsOne of the comments to the previous post got me thinking about testing code on the same machine under various compilers. It is fairly well known that the Intel compilers emit code that doesn’t select reasonable operational paths on AMD processors, which usually results in identical binaries having vast performance difference on very similar platforms. This doesn’t make a great deal of sense from a technological point of view … you want to test for SSE* support, and not use processor strings to set set paths, specifically in that paths which may be disabled by processor string on your own CPUs, which is enabled/fixed in an upgrade of the microcode would be deselected … That is, such efforts are self defeating in the end. Hopefully Intel will resolve that at some point. Ok, enough of that for the moment, the real purpose of this is to talk about some very surprising performance differences among 3 compilers, on identical code on the same machine(s). This code is not hard, and frankly should not be giving any of the compilers any problem.
The three compilers are the Intel v10.0 (and if you bug me for the exact release version I can look it up), the Portland Group 7.1-1 (I know 7.1-3 is out, and I pulled it down), and gcc 4.1.3.
The platforms are an AMD Opteron 275 with 4 GB ram, and our Pegasus Intel workstation with 5310 and 8 GB ram. All OSes are Ubuntu 7.10 x86_64. All binaries are 64 bit. For those who don’t know why, the 64 bit codes run faster on 64 bit OSes than 32 bit codes built from the same source. This has generally been our experience over the last several years (about 4 years now). We will show this in a minute.
The mission is simple, not to find out which platform is faster, but which compiler generates “better” code for simple problems.
Here is our loop. It consumes most of the time in the program.
for(k=inf-1;k>=1;k--) { sum += 1.0/pow(k,(double)n); }
This is our Riemann Zeta Function code, which computes the function for a particular cutoff (using the -l value flag), and the order of the function (using the -n value flag).
First off: we do a sanity check to make sure all compilers generate working code. No optimization, just build, run, check the results.
All work. Makes me happy.
Ok, lets set some reasonable optimization, so that we can run the same binary on all systems.
With gcc, use “-O3 -I. -Bstatic” . With icc, use “-O3 -xW -I. -Bstatic”. With pgcc use “-fastsse -tp x64 -I. -Bstatic”
Then run the code (3 times and average) using “-n 2 -l 100000000” as my arguments. What do we see from the calipers?
Here is what I am so surprised about. gcc is generating somewhat more efficient loop code than icc. But what is more surprising, is that pgcc trails so badly.
This is fairly simple code, and the Intel compiler indicates that it has vectorized it. The 10^8 iterations appear to have cost this extra optimization about 1 second of run time, so call it about another 10ns per iteration or so. If it vectorized the loop, then it should reduce the number of loop checks by some multiple of 2 so it fits into an SSE register.
Well, wanting to dig into this a little further, I used the -S -dA -dp options in gcc to look at the assembler for this loop. It is pretty simple:
.L13: # basic block 9 cvtsi2sd %ebx, %xmm0 # 169 *floatsidf2_sse/1 [length = 4] movq %rbp, 8(%rsp) # 398 *movdi_1_rex64/4 [length = 5] movsd 8(%rsp), %xmm1 # 385 *movdf_integer/8 [length = 6] call pow # 170 *call_value_0_rex64 [length = 5] movsd .LC9(%rip), %xmm1 # 386 *movdf_integer/8 [length = 8] subl $1, %ebx # 178 *addsi_2/1 [length = 3] divsd %xmm0, %xmm1 # 174 *fop_df_1_sse [length = 4] movapd %xmm1, %xmm0 # 387 *movdf_integer/7 [length = 4] addsd 48(%rsp), %xmm0 # 175 *fop_df_comm_sse [length = 6] movsd %xmm0, 48(%rsp) # 388 *movdf_integer/9 [length = 6] jne .L13 # 179 *jcc_1 [length = 2]
The computational portion uses SSE as expected, the %ebx register is used as the loop variable “k” … and curiously enough their is a function call in the middle to the math “pow” library. I would have thought this was inlined.
But basically, this is a straight/simple conversion of the loop innards to SSE without really vectorizing it. that is, the cvtsi2sd simply converts the value of k to floating point and stores it in one half (the “lower” half) of the SSE register xmm0. The rest of the SSE operations are being done on this without exploiting the benefit of SSE (vectorization), just the registers are used.
Hmmm. What does the intel compiler do? Remember, it vectorized it … (note that the Intel compiler assembly output with “-S -fsource-asm” is nice to see … )
;;; { ;;; sum += 1.0/pow(k,(double)n); cvtsi2sd %ebp, %xmm11 #92.36 movdqa _2il0floatpacket.2(%rip), %xmm10 #90.5 movl %eax, %r15d #90.5 andl $1, %r15d #90.5 negl %r15d #90.5 addl %eax, %r15d #90.5 lea -2(%r13), %ecx #90.5 lea -3(%r13), %edx #90.5 lea -4(%r13), %esi #90.5 movd %eax, %xmm9 #90.5 movd %edx, %xmm8 #90.5 punpckldq %xmm8, %xmm9 #90.5 xorl %r14d, %r14d #90.5 movl %eax, %r12d #92.36 movd %ecx, %xmm1 #90.5 movd %esi, %xmm0 #90.5 punpckldq %xmm0, %xmm1 #90.5 punpckldq %xmm1, %xmm9 #90.5 pxor %xmm8, %xmm8 #90.5 unpcklpd %xmm11, %xmm11 #92.36 # LOE rbx ebp r12d r13d r14d r15d xmm8 xmm9 xmm10 xmm11 ..B1.22: # Preds ..B1.47 ..B1.21 cvtdq2pd %xmm9, %xmm0 #92.26 movaps %xmm11, %xmm1 #92.36 call __svml_pow2 #92.36 # LOE rbx ebp r12d r13d r14d r15d xmm0 xmm8 xmm9 xmm10 xmm11 ..B1.47: # Preds ..B1.22 movaps _2il0floatpacket.3(%rip), %xmm1 #92.22 divpd %xmm0, %xmm1 #92.22 addpd %xmm1, %xmm8 #92.11 paddd %xmm10, %xmm9 #92.11 addl $2, %r14d #90.5 cmpl %r15d, %r14d #90.5 jb ..B1.22 # Prob 99% #90.5 # LOE rbx ebp r12d r13d r14d r15d xmm8 xmm9 xmm10 xmm11
basically, it looks like this code sets up the SSE registers, and uses the pack instructions to move data into the SSE registers. That is, it spends quite a bit of time on the setup, before ever getting to the computing portion. Then the computing portion is not quite the same as the gcc version, but it isn’t terribly different either. Basically using the same structure. SSE half registers for computing, and a power function call. My guess is that the major difference would be in the pow function. It is called 10^8 times, so all you need is 10ns execution difference, and that will add 1 second to your loop time.
Ok, what about the pgcc version? Adding “-Manno -Mkeep-asm -S” generates this code:
`
.LB1936:
lineno: 90
for(k=inf-1;k>=1;k–)
{
sum += 1.0/pow(k,(double)n);
}
milestone++;
rc=ftime(&caliper;[milestone]);
printf(“zeta(%i) = %-.15f \n”,n,sum);
if (n == 2)
{
cvtsi2sd %ebx,%xmm0
movsd 48(%rsp),%xmm1
.p2align 4,,1
call __fmth_i_dpowd
movsd 56(%rsp),%xmm1
subl $1,%ebx
testl %ebx,%ebx
divsd %xmm0,%xmm1
addsd 72(%rsp),%xmm1
movsd %xmm1,72(%rsp)
jg .LB1936
which, again, doesn't look all that different from the other two. Again, my guess is that the __fmth_i_dpowd call is quite expensive, likely far more expensive than in the other cases. Again, since it is called 10^8 times, and it is about 8 seconds slower, all you need is an extra 80ns in each function call. 80ns is about 160 instruction cycles. This isn't, relatively speaking, much. Though it does suggest a library implementation quality issue more than a compiler issue. In all cases, it does appear that the loops are not being "vectorized" in the sense that I would assume. That is, I don't see a vector of 2 values [k,k-1] being used as floating point for computing the loop innards, which it would do with 1/2 the iterations. In fact, setting this up should be fairly simple. We should be able to load inf,inf-1 into a register, use it in the computation, and then decrement by two. No need for conversion of an existing integer variable. So the cvtsi2sd may not be needed in each iteration. The issue does appear to be the pow function though. It isn't inlined, it is used as a function call. Compilers need to be conservative by necessity. You don't. So I rewote that interior loop. Remember, I want to replace that pow function. Note that my original usage of pow used an integer value of k. That is, as I found out from the man page, a no-no ... a bad thing ... incorrect usage. Implementing pow(integer,double) as a simple loop in the middle of the computation is fairly easy. So I did that. While I was at it, to help the compiler to vectorize, I unrolled the loops by 4, and created setup and tear down loops at the end of the unrolled loop. Technically the compilers *should* be doing this. Should be. Here is the modified inner loop:
d_n[0] = d_n[1] = d_n[2] = d_n[3] = (double)n;
/* pre-vector loop */
for(k=inf-1;k>start_index;k–)
{
sum += one/pow((double)k,(double)n);
}
l[0]=(double)(inf-1 - 0);
l[1]=(double)(inf-1 - 1);
l[2]=(double)(inf-1 - 2);
l[3]=(double)(inf-1 - 3);
p_sum[0] = p_sum[1] = p_sum[2] = p_sum[3] = zero;
for(k=start_index;k>end_index;k-=unroll)
{
d_pow[0] = l[0];
d_pow[1] = l[1];
d_pow[2] = l[2];
d_pow[3] = l[3];
for (m=n;m>1;m–)
{
d_pow[0] *= l[0];
d_pow[1] *= l[1];
d_pow[2] *= l[2];
d_pow[3] *= l[3];
}
p_sum[0] += one/d_pow[0];
p_sum[1] += one/d_pow[1];
p_sum[2] += one/d_pow[2];
p_sum[3] += one/d_pow[3];
l[0]-=four;
l[1]-=four;
l[2]-=four;
l[3]-=four;
}
sum = p_sum[0] + p_sum[1] + p_sum[2] + p_sum[3] ;
/* post-vector loop */
for(k=end_index-1;k>=1;k–)
{
sum += one/pow((double)k,(double)n);
}
`
Ok, redo the compilation, and what do we see:
I won’t include the gcc assembler output, other than to note that it does appear that it is not in fact doing the computation significantly differently than before … that is, only using 1/2 of the SSE register. This code was fairly well set up for SSE, so this is surprising. It suggests that we left performance on the table as it were … Looking at the pgcc assembler output, I see a similar thing. It is using just the lower half (the *sd instructions). The intel system goes through and packs everything into the SSE registers and … in the loops … actually uses the mulpd !!! Woo hoo! We are using SSE … for the inlined pow function … and parts of the main loop. The issue appears to be the packing/unpacking. We want to do that as infrequently as possible. My fervent hope is that for SSE5++ that Intel/AMD finally decide to make getting data into and out of the registers go faster (without needing to resort to packing/unpacking). In fact a nice simple construct would be a set of address registers with an associated set of increment registers that could be used to load each portion of the SSE register. Make it easier and less costly to get data in and out of the SSE system. It just bugs me seeing so much performance left on the table. That and I have to admit amazement at the gcc compiler. In the 3.4 and previous days, it wasn’t that good for HPC, the Intel/PGI/PathScale regularly whipped it. Now, on a relatively simple test, it is generating good code. Not great code, but good code. What annoys me is that the compilers I (and others) pay for, don’t seem to be doing either better, or in these cases, as well, as gcc. I have made the argument in the past that they have value in that they do better. I am not as convinced right now, that this is really the case. FWIW I have seen this recently, in the last 3 months, with other codes I have worked on, where the gcc compiler was generating excellent code compared to expensive compilers. If anyone wants the test cases, let me know. Full c-source, make files, etc. Would love to speak with the compiler folks so they can explain to me what I am doing wrong (if anything). I am not convinced I am, so they have somewhat of a harder task ahead.