Click here to register and download your free 30-day trial of Intel® Parallel Studio XE.

The future is here―at least when it comes to next-generation hardware from Intel. Hardware accelerators have long been touted as the next big thing in high-performance computing, with many vendors offering their products and services. However, the programming models associated with taking full advantage of accelerators have rather steep learning curves, and often result in the programmer being locked into proprietary code.

Taking a leap forward to alleviate such issues, the new Intel® Xeon Phi™ processor (code-named Knights Landing, or KNL) is Intel’s first processor to deliver the performance of an accelerator with the benefits you’ve come to expect from a standard CPU. Intel Xeon Phi processors do not need to be explicitly programmed, nor do they require code offload. This allows for your programs― whether or not you’re already using standard OpenMP*―to run seamlessly on the new hardware. Helping you modernize your code and leverage multilevel scalable parallel performance are powerful software development tools from Intel, part of the Intel® Parallel Studio XE 2017 suite.

What’s more, because the Intel Xeon Phi processor is binary compatible with Intel® Xeon® processors, any workload that you run today on x86 architecture can be tuned and optimized for the high parallelism offered by Intel Xeon Phi processors.

A Quick Primer on Code Modernization

The Intel Xeon Phi processor supports Advanced Vector Extensions (AVX-512), which programs can use to pack eight double-precision or 16 single-precision floating-point numbers within the 512-bit vectors. In general, the 512-bit vector extensions deliver a comprehensive set of functionality and higher performance than the AVX and AVX2 family of instructions. Efficiently packing the vectors to minimize waste is easier said than done. On top of vector instructions, the runtime environment has several threads available to it, allowing for another level of parallelism. This complicates things. Modern code design takes into consideration different levels of parallelism:

Vector parallelism: Within a core, identical computations are performed on different chunks of data―so-called SIMD (single instruction, multiple data) vectorization. Both scalar and parallel portions of code will benefit from the efficient use of SIMD vectorization.
Thread parallelism: Multiple threads communicate via shared memory and collectively cooperate on a given task.

Fortunately, the latest OpenMP standards (4.0 and beyond) provide constructs to easily add both explicit SIMD vectorization and thread parallelism capability to your C/C++ and Fortran codes. The Intel® compilers include support for OpenMP.

When properly utilized, SIMD vectorization and threading lead to enormous potential speedups. Figure 1 shows up to a 179x speedup over serial code on the latest Intel Xeon processor. Similar performance advancements can be expected on Intel Xeon Phi processors.

Figure 1 - Effective use of SIMD vectorization and thread parallelism gives impressive performance improvement over sequential code on nontrivial binomial options workloads

Intel Parallel Studio XE and the Code Modernization Process

The code modernization optimization framework has been well documented as a systematic approach to application performance improvement. This framework, equally applicable to Intel Xeon Phi processors, takes an application through various optimization stages, each stage iteratively improving the application performance:

Profile your current code and workload: Use Intel® VTune™ Amplifier to identify hotspots and Intel® Advisor to identify vectorization and threading opportunities and headroom. Intel® compilers then generate optimal code and apply optimized libraries such as Intel® Math Kernel Library (Intel® MKL), Intel® Threading Building Blocks (Intel® TBB), and Intel® Integrated Performance Primitives (Intel® IPP)..
Optimize your scalar code: Identify whether you are working with the correct data type precision, are using appropriate functions, and are setting the precision flags during compilation.
Explicitly SIMD vectorize your code: Utilize OpenMP-based SIMD vectorization features in conjunction with data layout optimizations. Apply correct data structures, and convert C++ code from arrays of structures to structures of arrays using Intel® SIMD Data Layout Templates (see Parallel Universe magazine issue 24).
Thread-parallelize your code: Use OpenMP and available environment variables to set the appropriate affinity of threads to cores. Use Intel® Inspector to debug and make sure no threading errors are causing scaling issues, which are typically a result of thread synchronization or inefficient memory utilization.

Figure 2 - Code modernization on Intel® Xeon Phi™ processors

The various components of Intel Parallel Studio XE―compilers, optimized libraries, and performance analyzers/profilers―aid in modernizing your code to leverage the full potential of Intel Xeon Phi processors. Figure 2 breaks down the code modernization process on Intel Xeon Phi processors and shows how these Intel tools help you get there. Not running on a cluster? Just ignore the top two boxes and start with "effective threading."

Intel compilers offer support for the latest AVX-512 instructions with Intel Xeon and Xeon Phi processors. The compilers also support the latest C++ and Fortran standards with backwards compatibility. Additionally, OpenMP 4 and 4.5 allow you to be well on your way to achieving supercharged performance as offered by explicit SIMD vectorization and threading (Figure 1). (Look for compiler specifics for Intel Xeon Phi processors later in this article.)

Intel VTune Amplifier 2017 offers several optimizations critical for Intel Xeon Phi processors. You need to decide how best to use MCDRAM―a key feature of Xeon Phi processors. Intel VTune Amplifier’s pipeline/cache, memory analysis, and scalability analysis help you:

Decide which data structures to place in MCDRAM
See performance problems by memory hierarchy
Measure DRAM and MCDRAM bandwidths

Figure 3 - Sample Intel® Advisor Summary output on a real-world workload executed on the Intel® Xeon Phi™ processor

Additionally, you can address OpenMP scalability issues by measuring serial versus parallel time, and detect imbalance and overhead costs. Finally, Intel VTune Amplifier exposes the Intel Xeon Phi processor microarchitecture efficiencies by letting you see the efficiency of your code in the core pipeline.

Intel Advisor is a central pillar of code modernization. You need Intel Advisor to easily identify headroom and then optimize for AVX-512 instructions specific to Intel Xeon Phi processors. For example, the Intel Advisor Summary for Intel Xeon Phi on a real workload (Figure 3) indicates that a lot of kernels were vectorized, but there is still huge room for further improvement.

Providing accelerated mathematical functionality for Intel Xeon and Xeon Phi processors, the Intel MKL is an indispensable tool to achieve automatic performance scaling from within the single core (explicit vectorization) to multicore (threading)―key components of code modernization. Additionally, the new Intel® Distribution for Python* is optimized using Intel MKL under the covers. Intel MKL can determine the best load balancing between the host CPU and the Intel Xeon Phi coprocessor, in case you are running on the host-offload model. (See detailed coverage of the Intel Distribution for Python.)

Recently celebrating 10 years in the industry, Intel TBB is a widely used, award-winning C++ library for creating high-performance, scalable parallel applications. (You can find many of the optimizations Intel TBB has to offer in a special issue of Parallel Universe magazine.)

Intel IPP provides you with ready-to-use, processor-optimized building blocks to accelerate image, signal, and data processing and cryptography computation tasks. These routines are optimized for the latest instruction sets, including AVX-512, which allow you to get performance beyond what an Intel Xeon Phi processor-optimized compiler produces alone.

Intel Compiler 17.0 Support on the Intel Xeon Phi Processor

We now do a deeper analysis and cover the vectorization opportunities afforded to you by Intel compilers. Pertaining specifically to the Intel Xeon Phi processor, the general features of the Intel Compiler apply to standard C/C++ and Fortran language applications. The major enhancements deal with the extended AVX-512 instruction sets on Intel Xeon Phi processors.

Intel Compiler 17.0 extends its support for vectorization with the following enhancements:

Indirect vector function calls including virtual functions (see issue 25 of Parallel Universe magazine)
OpenMP 4.5 support on reductions over arrays and array sections
Prefetch for indirect memory references

The Intel Compiler 17.0 also strengthens its existing support of AVX-512 on Intel Xeon Phi processors with compiler option -xMIC-AVX512.

Vectorization for AVX-512

The base of the 512-bit SIMD instruction extensions are referred to as Intel AVX-512 Foundation instructions. They include extensions of the AVX and AVX2 family of SIMD instructions but are encoded using a new encoding scheme with support for 512-bit vector registers, up to 32 vector registers in 64-bit mode, and conditional processing using opmask registers.

The Intel AVX-512 family provides several additional 512-bit extensions in groups of instructions that target specific application domains:

Intel AVX-512 exponential and reciprocal instructions (Intel AVX-512ER) for certain transcendental mathematical computations
Intel AVX-512 prefetch instructions for specific prefetch operations (Intel AVX-512PR)
Other instructions targeting SHA, MPX, etc.

As shown in Figure 4, some of these instructions are supported by the latest Intel Xeon Phi processors (starting from KNL), and others will be supported by future Intel Xeon processors.

Figure 4 - Intel® Xeon Phi™ and future Intel® Xeon® processors share a large set of instructions

1. Intel AVX-512

Foundation instructions are a natural extension to AVX and AVX2. They support both instruction sets operating on 512-bit vector and instruction set extensions encoded using the EVEX prefix encoding scheme to operate at vector lengths smaller than 512 bits. Therefore, loops vectorized previously on Intel Xeon processors will naturally vectorize on Intel Xeon Phi processors. The vector length may be extended depending on loop counts and array lengths.

To enable an existing application on Intel Xeon processors to run on Intel Xeon Phi processors with AVX-512, simply compile with the option –xMIC-AVX512 and copy the generated binary to an Intel Xeon Phi processor-based system.

2. AVX-512PR

AVX-512PR and AVX-512ER are two unique instruction sets only available on Intel Xeon Phi processors. AVX-512PR is a new set of prefetch instructions for gather/scatter and PREFETCHWT1. To enable this with Intel Compiler 17.0, you need the following combination of options:

-O3 -xmic-avx512 -qopt-prefetch=<n>

n=0 is the default if you omit -qopt-prefetch option (no prefetches will be issued)
n=2 is the default if you just say -qopt-prefetch with no explicit "n" argument. This will insert prefetches only for direct references where the compiler thinks hardware prefetcher may not be applicable
n=3 will turn on prefetching for all direct memory references without regard to hardware prefetcher
n=5 will turn on prefetching for all direct and indirect prefetches. Indirect prefetches will use the AVX512-PF gatherpf instructions.

You can also use pragmas of the form #pragma prefetch var:hint:distance to force the compiler to issue prefetches for particular memory references. This will issue the prefetch for pragma-specified direct or indirect references even at -qopt-prefetch=2 or 3 setting.

As an example, we compile the following C++ loop with -O3 -xmic-avx512 -qoptprefetch= 5 –qopt-report5:

//pragma_prefetch var:hint:distance
#pragma prefetch A:1:3
#pragma vector aligned
#pragma simd
    for(int i=0; i<n; i++) {
    C[i] = A[B[i]];
    }

You will get remarks on generation of gather/scatter prefetches for indirect memory reference, similar to Figure 5.

Figure 5 - Gather/scatter prefetch remarks shown in compiler optimization report

3. AVX-512ERM

AVX-512ER provides fast and high-precision approximation instructions for exponential, reciprocal, and reciprocal square root functions. AVX-512 for exponential instructions (VEXP*) are an approximation to the exponential 2^x of packed double-precision or single-precision floating-point values with less than 2^-23 relative error. These instructions are only available for the Intel Xeon Phi processor.

Currently, these instructions only support the ZMM register, which is 512 bits long. This requires more than eight double-precision or 16 single-precision floating numbers to calculate the exp function together to fill the long vector. Otherwise, the small vector math library (SVML) version of exp will be called. For example, the Fortran loop shown in Figure 6 will not generate VEXP instructions if arrayB(i) is single precision.

do i =1,8
    arrayC(i) = arrayA(i)*exp(arrayB(i))
end do

Figure 6 - Fortran loop

To overcome this limitation, we merge small arrays with similar exp calls into a bigger array. In Figure 7, the loop count can be extended for ZMM vector, and vectorization will be beneficial from the hardware implementation of VEXP instructions that are much more efficient on Intel Xeon Phi processors. Figure 7 shows an example in Fortran.

Module testER
implicit none
real, parameter, dimension(8) :: arrayA1 = (/2.0,1.5,1.37,2.4,3.3,4.9,5.
1,0.0/)
real, parameter, dimension(8) :: arrayA2 = (/0.3,7.1,4.1,3.8,9.1,0.5,0.0
,1.2/)
real, parameter, dimension(8) :: arrayB1 = (/8.0,1.2,1.4,1.7,2.58,3.4,5.
0,7.1/)
real, parameter, dimension(8) :: arrayB2 = (/0.6,1.3,2.8,9.6,2.3,1.5,0.2
,0.3/)
real, parameter, dimension(16) :: arrayA = &
(/2.0,1.5,1.37,2.4,3.3,4.9,5.1,0.0,0.3,7.1,4.1,3.8,9.1,0.5,0.0,1.2/)
real, parameter, dimension(16) :: arrayB = &
(/8.0,1.2,1.4,1.7,2.8,3.4,5.0,7.1,0.6,1.3,2.8,9.6,2.3,1.5,0.2,0.3/)
!DIR$ ATTRIBUTES ALIGN : 64 :: arrayA, arrayB, arrayA2, arrayB2

contains
subroutine cal_exp_16(arrayC)
real, dimension(16) :: arrayC
integer i

!DIR$ VECTOR ALIGNED
do i=1,16
    arrayC(i) = exp(arrayA(i))*arrayB(i)
enddo
end subroutine

subroutine cal_exp_2x8(arrayC)
real, dimension(16) :: arrayC
integer i

!DIR$ VECTOR ALIGNED
do i=1,8
    arrayC(i) = exp(arrayA1(i))*arrayB1(i)
end do

do i=1,8
    arrayC(i+8) = exp(arrayA2(i))*arrayB2(i)
enddo

end subroutine
end module

Figure 7 - Fortran example

Array arrayA is a combination of arrayA1 and arrayA2. Routine cal_exp_16 is required to calculate exp for 16 single-precision numbers in arrayA and multiply the results with another 16 single-precision numbers. Routine cal_exp_2x8 calculates them separately with arrayA1 and arrayA2 in two small loops. An experimental performance shows that cal_exp_16 runs over 5x faster than cal_exp_2x8 on the Intel Xeon Phi processor, as shown in Figure 8. Compiler options are "-xCORE-AVX2 –O2" for HSW-EP and "–xMIC-AVX512 –O2" for Intel Xeon Phi. Note that this is single-core performance and frequency of HSW-EP is 1.6x over KNL. The cal_exp_16 version binary with VEXP enabled excels at 12 percent over AVX2 on Haswell-EP (HSW-EP).

From this example, calling the SVML version of exp using a 256-bit vector on Intel Xeon Phi processors is not as efficient as on Intel Xeon processors. Merging small arrays to use 512-bit vector on an Intel Xeon Phi processor allows the compiler to generate fast AVX-512ER instructions, which help greatly improve performance.

HSW-EP Intel® Xeon® CPU E5-2699 v3 @ 2.30 GHz
KNL Intel® Xeon Phi™ CPU 7250 @ 1.40 GHz

Figure 8 - Unit test performance (speedup) measurements

4. AVX-512CD

AVX-512CD is a newly introduced instruction set starting with the Intel Xeon Phi processor, and will also be available in future Intel Xeon processors. It can be used to detect conflicts within a vector, which will be helpful to resolve dependencies in specific cases within a loop. A common scheme can be characterized as the "histogram update." This is when a memory location is read, operated on, and then stored to. A sample piece of C code with the access pattern is shown in Figure 9.

for (i=0; i < 512; i++)
    histo[key[i]] += 1;

Figure 9 - Access pattern

Compiler auto-vectorization will generate conflict detection instructions and vectorize such a loop, even though there are potential dependencies: when key[n] and key[m] are the same, array histo must be read and written in the correct order.

Such an access pattern can be extended to more complex cases with multiple histogram updates (Figure 10).

for (int j = 0; j < 512; j++)
    const int j1 = (int)ind1[j];
    const int j2 = (int)ind2[j];

for (int i = 0; i < 512; ++) {
    const int k1 = (int)key1[i];
    const int k2 = (int)key2 [i];
    histo[j1][k1] += a[i] ;
    histo[j2][k1] += b[i];
    histo[j1][k2] += c[i] ;
    histo[j1][k2] += d[i] ;
}

Figure 10 - Complex access pattern

This loop with four histogram updates and 2D indices can be auto-vectorized by the compiler with the support of AVX-512CD.

The Final Word

With the tools overview and the compiler deep dive, we have illustrated that there are several new evolutionary―even revolutionary―features in Intel Parallel Studio 2017. These features can help you vectorize and multithread your legacy and new C/C++/Fortran code to leverage the power of the Intel Xeon Phi processor, Intel’s newest processor for delivering deeper insight. Start modernizing your code today.

Resources