Vectorization Opportunities for Improved Performance with Intel® AVX-512

Intel

4.00/5 (1 vote)

Apr 17, 2017

CPOL

8 min read

21908

Examples of How Intel® Compilers Can Vectorize and Speed Up Loops

Click here to register and download your free 30-day trial of Intel® Parallel Studio XE.

Martyn Corden, Software Technical Consulting Engineer, Intel Corporation

The days of easy performance improvements from steadily increasing clock speeds are long gone. Moore’s Law instead provides increased parallelism through additional cores and more and wider SIMD registers. Besides the increased SIMD vector width of 512 bits, Intel^® Advanced Vector Extensions 512 (Intel^® AVX-512) includes new instructions that enable vectorization of some loops that could not be vectorized with older instruction sets and more efficient vectorization of others.

In this article, we’ll explain, with examples, how Intel^® C/C++ and Fortran compilers can vectorize and speed up loops that compress or expand arrays and loops that fill histograms or perform scatters with potential address conflicts. We will also show how the compiler can convert strided loads or gathers into more efficient unit stride SIMD loads for certain types of loops over arrays of structures. Finally, we’ll show how to recognize this conversion using new features of the optimization report in the Intel^® Compiler version 17.0 from Intel^® Parallel Studio XE 2017<LINK TO https://software.intel.com/en-us/intel-parallel-studio-xe >. These optimizations can benefit a broad range of applications running not only on the Intel^® Xeon Phi™ X200 processor, but also on future Intel^® Xeon^® processors that support the Intel AVX-512 instruction set.

A Vectorization Refresher

Figure 1 illustrates a simple, double-precision floating-point loop. In scalar mode, one instruction produces one result. After vectorization, a single Intel AVX-512 instruction can produce eight results, compared to four for Intel^® AVX, or two for Intel^® Streaming SIMD Extensions (Intel^® SSE). The Intel Xeon Phi x200 processor supports Intel AVX-512 instructions for a wide variety of operations on 32- and 64-bit integer and floating-point data. This may be extended to 8- and 16-bit integers in future Intel Xeon processors.

Automatic vectorization is enabled by default in the Intel^® compiler. However, one of the switches in Figure 2 must be set in order to target the Intel AVX-512 instructions.

Linux* and OS X*	Windows*	Functionality
-xmic-avx512	/Qxmic-avx2	Intel^® Xeon Phi™ x200 processor family only
-xcore-avx512	/Qxcore-avx512	Future Intel^® Xeon^® processor only
-xcommon-avx512	/Qxcommon-avx512	Intel^® AVX-512 subset common to both. Not a fat binary.
-axmic-avx512	/Qaxmic-avx512	Fat binary. Targets Intel^® Xeon Phi™ x200 and also other Intel^® Xeon^® processors
-xhost	/Qxhost	Targets the compilation host

Table 1. Compiler switches to enable Intel^® AVX-512 instructions

Compress and Expand Loops

The Fortran example in Figure 2A illustrates array compression. Only those elements of a large source array that satisfy a condition are copied into a smaller target array. The C example in Figure 2B illustrates the inverse operation (i.e., array expansion), where elements of a smaller source array are copied back into the larger, sparse array.

   nb = 0
   do ia=1, na               ! line 23
     if (a(ia) > 0.) then
       nb = nb + 1          ! dependency
       b(nb) = a(ia)         ! compress
     endif
   enddo

Figure 2A. Array compression

int j = 0
for (int i=0; i <N; i++) {
    if (a[i] > 0) {
        c[i] = a[k++];   // expand
    }
}
// Cross-iteration dependencies via j and k

Figure 2B. Array expansion

The conditional increment of the dense array index introduces a dependence between loop iterations. In the past, this would have prevented automatic vectorization. For example, when compiling the loop in Figure 2A targeting Intel^® AVX2, the optimization report shows:

ifort -c -xcore-avx2 -qopt-report-file=stderr -qopt-report=3 compress.f90
…
LOOP BEGIN at compress.f90(23,3)
  remark #15344: loop was not vectorized: vector dependence prevents vectorization.  …                              
  remark #15346: vector dependence: assumed ANTI dependence between nb (25:7) and nb (25:7)
LOOP END

The C example in Figure 2B behaves in the same way. An OpenMP* SIMD directive cannot be used either, because the dependency would lead to incorrect results.

Intel AVX-512 overcomes this dependency with the new vcompress instructions that write selected elements of one SIMD register into contiguous elements of another, or to memory. Likewise, the vexpand instructions write contiguous elements from a source register or memory to selected (sparse) elements of a destination SIMD register. These new instructions allow the compiler to vectorize the compression example (Figure 2A) when Intel AVX-512 is enabled (Figure 3).

ifort -c -xmic-avx512 -qopt-report-file=stderr -qopt-report=3 compress.f90
…
LOOP BEGIN at compress.f90(23,3)
   remark #15300: LOOP WAS VECTORIZED
   remark #15450: unmasked unaligned unit stride loads: 1
   remark #15457: masked unaligned unit stride stores: 1
…
   remark #15497: vector compress: 1
LOOP END

Figure 3. Vectorization when Intel AVX-512 is enabled

All elements of the source array are loaded, so the load is unmasked. Only selected elements are stored, so the effective store is masked. The optimization report contains remark #15497 to show that a compress idiom was recognized and vectorized. An assembly listing, obtained with the -S option, shows instructions such as:

vcompressps %zmm4, -4(%rsi,%rdx,4){%k1}

Similar results are obtained for the array expansion loop in Figure 2B.

Compressing a single-precision array of 1,000,000 random values by a factor of two, repeated 1,000 times on an Intel Xeon Phi 7250 processor, resulted in an approximately 16x speedup of Intel AVX-512 over Intel AVX2, which corresponds to the width of the SIMD registers and instructions.

Intel AVX-512 Conflict Detection Instructions

Loops containing a store with indirect addressing have a potential dependency that typically prevents vectorization. For example:

for (i=0; i<n; i++)  a[index[i]] = …

If index[i] has the same value for two different values of i, the stores conflict and cannot be executed safely at the same time. The vpconflict instruction from Intel AVX-512 resolves this conflict by providing a mask for those SIMD lanes (values of i) that are conflict-free (i.e., no values of index[i] are duplicated). After the SIMD computation has been safely performed for these lanes, the loop is re-executed for any lanes that were masked out.

Histogramming

Filling a histogram is a common operation in many applications (e.g., image processing). The code fragment in Figure 4 fills a histogram of sin(x) in the array h for an array x of input values. With Intel AVX2, this does not vectorize because of a dependency, since two input values could contribute to the same histogram bin, ih:

  for (i=0; i<n; i++) {                   
      y     = sinf(x[i]*twopi);
      ih    = floor((y-bot)*invbinw);
      ih    = ih > 0     ? ih : 0;
      ih    = ih < nbin-1 ? ih : nbin-1;
      h[ih] = h[ih] + 1;                       //  line 25
   }

Figure 4. Loop to fill a histogram of sin(x)

icc -c -xcore-avx2 histo.c -qopt-report-file=stderr -qopt-report-phase=vec
…
LOOP BEGIN at histo2.c(20,4)
      remark #15344: loop was not vectorized: vector dependence prevents vectorization…
      remark #15346: vector dependence: assumed FLOW dependence between h[ih] (25:7) and h[ih] (25:7)
LOOP END

Figure 5.

Intel AVX-512 conflict detection instructions allow this loop to be vectorized safely:

ifort -c -xmic-avx512 histo2.f90 -qopt-report-file=stderr -qopt-report=3 -S
…
LOOP BEGIN at histo2.c(20,4)
   remark #15300: LOOP WAS VECTORIZED
   remark #15458: masked indexed (or gather) loads: 1 
   remark #15459: masked indexed (or scatter) stores: 1
   remark #15499: histogram: 2
LOOP END

Figure 6.

In the assembly code (not shown), x and index are loaded, and ih is calculated for all SIMD lanes. This is followed by a vpconflict instruction and a masked gather of the unique bins (i.e., elements of h) into a register, which are then incremented. If some values of ih were duplicated, the code loops back and reincrements the corresponding bins. Finally, there is a masked store (scatter) of the incremented bins back into h.

A simple test that filled a histogram of 200 bins from a single-precision array of 100,000,000 random values between 0 and 1, repeated 10 times, was compiled first for Intel AVX2 and then for Intel AVX-512 and run on an Intel Xeon Phi 7250 processor. A nearly 9x speedup was observed. The speedup depends heavily on the problem details. It comes mostly from vectorization of the loop computations, such as the sine function in this example, not from the gather and scatter themselves. The speedup can be large in the common case where conflicts are relatively infrequent. However, it may be smaller if there are many conflicts, as in a histogram containing a singularity or narrow spike.

Gather-to-Shuffle Optimization

Large arrays of small structures or of short vectors occur frequently but present a challenge to efficient vectorization. Vectorization over the structure content or short vectors may be impossible or inefficient due to the low trip count. Vectorization over the large array index is also inefficient because data used by consecutive iterations are not adjacent in memory, so efficient SIMD loads of contiguous data cannot be used. This is illustrated in Figure 7, which simply calculates the sum of squares of the components of an array of 3-vectors.

struct Point { float x; float y; float z; };

float sumsq( struct Point *ptvec, int n) {
    float   t_sum = 0;

    for (int i=0; i<n; i++) {
        t_sum += ptvec[i].x * ptvec[i].x;
        t_sum += ptvec[i].y * ptvec[i].y;
        t_sum += ptvec[i].z * ptvec[i].z;
    }
    return  t_sum;
}

Figure 7.

The older version 15 compiler vectorizes this loop using strided loads or gathers to load each component of the struct:

icc -std=c99 -xmic-avx512 -qopt-report=4 -qopt-report-file=stderr -qopt-report-phase=vec,cg
…
LOOP BEGIN at g2s.c(6,5)
   remark #15415: vectorization support: gather was generated for the variable ptvec:  strided by 3   
…
   remark #15300: LOOP WAS VECTORIZED
   remark #15460: masked strided loads: 6

Figure 8.

However, the compiler can do better by noting that the x, y, and z components of the struct are adjacent in memory. The compiler can perform SIMD loads of the components, and then perform permutations and shuffles to transpose the data so it is laid out for vectorization of the loop over i, as illustrated in Figure 9.

Figure 9. Data transpose to enable vectorization

When compiled with the newer version 17 compiler, the optimization report contains a "code generation" component:

Report from: Code generation optimizations [cg]
sumsq.c(10,22):remark #34030: adjacent sparse (strided) loads optimized for speed. Details: stride { 12 }, types { F32-V512, F32-V512, F32-V512 }, number of elements { 16 }, select mask { 0x000000007 }.

Figure 10.

This shows that the compiler was able to convert the original strided loads into contiguous SIMD loads followed by shuffles and permutes. A simple test loop over 10,000 random points, repeated 100,000 times, was compiled for Intel AVX-512 and run on an Intel Xeon Phi 7250 processor. The observed speedup using the version 17 compiler, compared to the version 15 compiler, was more than a factor of 2. The same optimization is performed if pt is a two-dimensional array pt[10000][3], or in Fortran, if pt is a two-dimensional array pt(3,10000) or a derived type array.

The compiler can sometimes perform this "gather to shuffle" optimization when targeting other instruction sets. However, the powerful new shuffle and permute instructions result in it being generated much more frequently when targeting Intel AVX-512.

Summary

Powerful new instructions in Intel AVX-512 enable improved application performance through the vectorization of some loops that could not be vectorized previously, and more efficient vectorization of others. The compiler optimization report shows when and where these optimizations are performed.

Additional Resources

Initially written for Intel^® Xeon Phi™ x100 coprocessors but also applicable elsewhere: