# Playing with AVX

By **
joe
**

I **finally** took some time from a busy schedule to play with AVX. I took my trusty old rzf code (Riemann Zeta function) and rewrote the time expensive inner loop in AVX primatives hooked to my C code.
As a reminder, this code is a very simple sum reduction, and can be trivially parallelized (sum reduction). Vectorization isn’t as straightforward, and I found that compiler auto-vectorization doesn’t work well for it. So you have to hand vectorize, and think through the algorithm more carefully.
The baseline code inner loop looks like this
`for(k=inf-1;k>=1;k--) { sum += 1.0/pow(k,(double)n); }`

Remarkable how trivial this is. You can unroll it, play with a number of things, but you don’t start seeing serious performance until you vectorize it. Between standard -O3 level gcc optimization and the by-hand vectorization, we get a factor of 10 better performance.
So I thought, hmmm … why not transform the sse2 into avx and see how it runs?
The sse2 inner loop …
`__m128d __P_SUM = _mm_set_pd1(0.0); // __P_SUM[0 ... VLEN] = 0 __m128d __ONE = _mm_set_pd1(1.); // __ONE[0 ... VLEN] = 1 __m128d __DEC = _mm_set_pd1((double)VLEN); __m128d __L = _mm_load_pd(l); /* parallel sum loop */ for(k=start_index;k>end_index;k-=unroll) { __D_POW = __L; for (m=n;m>1;m--) { __D_POW = _mm_mul_pd(__D_POW, __L); } __P_SUM = _mm_add_pd(__P_SUM, _mm_div_pd(__ONE, __D_POW)); __L = _mm_sub_pd(__L, __DEC); } _mm_store_pd(p_sum,__P_SUM); /* sum reduction loop */ for(k=0;k=1;k--) { sum += one/pow((double)k,(double)n); }`

There is a wind down loop at the end of the reduction to handle any final number of terms that are not integer multiples of the unroll amount (2 for sse2, 4 for avx).
Of course, AVX is the “new shiny hotness”. And we got 10x by vectorizing by hand. Might we get more with AVX?
My first pass is this
`sum = 0.0; /* pre-vector loop */ for(k=inf-1;k>start_index;k--) { sum += one/pow((double)k,(double)n); } // __P_SUM[0 ... VLEN] = 0 */ register __m256d __P_SUM asm ("ymm2") = _mm256_broadcast_sd(__zero); // __ONE[0 ... VLEN] = 1 register __m256d __ONE asm ("ymm3") = _mm256_broadcast_sd(__one); // __DEC[0 ... VLEN] = VLEN register __m256d __DEC asm ("ymm4") = _mm256_broadcast_sd(__inc); register __m256d __L asm ("ymm5") = _mm256_load_pd(l); register __m256d __INV asm ("ymm6") ; for(k=start_index;k>end_index;k-=2*unroll) { __D_POW = __L; for (m=n;m>1;m--) { __D_POW = _mm256_mul_pd(__D_POW, __L); } __INV = _mm256_div_pd(__ONE, __D_POW); __P_SUM = _mm256_add_pd(__P_SUM, __INV); __L = _mm256_sub_pd(__L, __DEC); __D_POW = __L; for (m=n;m>1;m--) { __D_POW = _mm256_mul_pd(__D_POW, __L); } __INV = _mm256_div_pd(__ONE, __D_POW); __P_SUM = _mm256_add_pd(__P_SUM, __INV); __L = _mm256_sub_pd(__L, __DEC); } _mm256_store_pd(p_sum,__P_SUM); /* sum reduction loop */ for(k=0;k=1;k--) { sum += one/pow((double)k,(double)n); }`

Performance gives a very minor regression relative to sse2. Same answers. And yes, I did unroll the core loop.
I think with some hand tweaking I could probably get more. Will look. But enjoy in the meanwhile.