Click here to Skip to main content
15,867,329 members
Articles / High Performance Computing / Vectorization
Article

Optimizing Application Performance with Roofline Analysis

14 Jun 2017CPOL5 min read 12.6K  
NERSC boosts the performance of its scientific applications on Intel® Xeon Phi™ processors up to 35% using Intel® Parallel Studio and Intel® Advisor

This article is in the Product Showcase section for our sponsors at CodeProject. These articles are intended to provide you with information on products and services that we consider useful and of value to developers.

Click here to register and download your free 30-day trial of Intel® Parallel Studio XE

The National Energy Research Scientific Computing Center (NERSC) is the primary scientific computing facility for the U.S. Department of Energy’s Office of Science. Located at Lawrence Berkeley National Laboratory, the NERSC Center serves more than 6,000 scientists at national laboratories and universities researching a wide range of problems in combustion, climate modeling, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines.

To meet its goals, NERSC needs to optimize its diverse applications for peak performance on Intel® Xeon Phi™ processors. To do that, it uses a roofline analysis model based on Intel® Advisor, a tool in the Intel® Parallel Studio software suite. The roofline model was originally developed by Sam Williams, a computer scientist in the Computational Research Division at Lawrence Berkeley National Laboratory. Using the model increased application performance up to 35%.

Roofline Analysis

"Optimizing complex applications demands a sense of absolute performance," explained Dr. Tuomas Koskela, postdoctoral fellow at NERSC. "There are many potential optimization directions. It’s essential to know which direction to take, what factors are limiting performance, and when to stop."

Roofline analysis helps to determine the gap between applications and peak performance of a computing platform. This visually intuitive performance model bounds the performance of various numerical methods and operations running on multi-core, many-core, or accelerator processor architectures.

Instead of simply using percent-of-peak estimates, the model can be used to assess the quality of performance by combining locality, bandwidth, and different parallelization paradigms into a single performance figure. The roofline figure helps determine both the implementation and inherent performance limitations (Figure 1).

A classic roofline model includes three measurements:

  • The number of floating point operations per second (FLOP/s)
  • The number of bytes from DRAM
  • Computation time

All-in-One Tool

The Intel Advisor roofline implementation provides even more insights than a standard roofline analysis by plotting more rooflines:

  • Cache rooflines illustrate performance if all the data fits into the respective cache.
  • Vector usage rooflines show the maximum achievable performance levels if vectorization is used effectively.

In a classic roofline model, bytes are measured out of a given level in memory hierarchy. Arithmetic intensity (AI) depends on the problem size and intensity. Also, memory optimizations will change AI.

Intel Advisor is based on a cache-aware roofline model, where bytes are measured into the CPU from all levels of memory hierarchy. AI is independent of problem size and platform and consistent for a given algorithm.

Figure 2 shows how a point moves in a classic versus a cache-aware roofline model.

Figure 3 shows Intel Advisor’s cache-aware roofline model report. The red loop is the most time-consuming, while the green loops are insignificant in terms of computing time. The larger loops will have more impact if optimized. The loops furthest away from a roof have the most potential for improvement.

Cache-Aware Roofline Model in Action

NERSC used Intel Advisor’s cache-aware roofline model to optimize two of its key applications:

PICSAR*, a high-performance particle-in-cell (PIC) library for many integrated core (MIC) architectures

XGC1*, a PIC code for fusion plasmas

The PICSAR application was designed to be interfaced with the existing PIC code, WARP*. Providing high-performance PIC routines to the community, it is planned to be released as an open source project.

The application is used for projects in plasma physics, laser-matter interaction, and conventional particle accelerators. Its optimizations include:

  • L2 field cache blocking, where the MPI domain decomposes into tiles
  • Hybrid parallelization, with OpenMP* handling tiles (inner-node parallelism)
  • New data structures to enable efficient vectorization (current/charge disposition)
  • An efficient parallel particle exchange algorithm between tiles
  • Parallel optimized pseudo spectral Maxwell solver
  • Particle sorting algorithm (memory locality)

NERSC applied the roofline model to three configurations:

  • No tiling and no vectorization
  • Tiling (L2 cache blocking) and no vectorization
  • Tiling (L2 cache blocking) and vectorization
  • The XGC1 application is a PIC code for simulating plasma turbulence in Tokamak (edge) fusion plasmas. Its complicated geometry includes:
  • Unstructured mesh in 2D (poloidal) planes
  • Nontrivial, field-following (toroidal) mapping between meshes

Figure 1. Roofline visual performance model

Figure 2. How a point moves in classic versus cache-aware roofline model

Figure 3. Intel Advisor report
  • Typical simulations with 10,000 particles per cell, 1,000,000 cells per domain, and 64 toroidal domains
  • Most of the computation time in SGC1 is spent in electron subcycling. Bottlenecks included:
  • Field interpolation to particle position in field gather
  • Element search on the unstructured mesh after push]
  • Computation of high-order terms in gyrokinetic equations of motion in push

With a single Intel Advisor survey, NERSC was able to discover most of the bottlenecks (Figure 4).

The optimizations included:

  • Enabling vectorization by inserting loops over blocks of particles inside short-trip-count loops
  • Data structure reordering to store field and particle data in SoA format, which is best for accessing multiple components with a gather instruction
  • Algorithmic improvements including reducing the number of unnecessary calls to the search routine and sorting particles by the element index instead of local coordinates.

Figure 4. Discovering XGC1 bottlenecks

Conclusions

Intel Advisor’s roofline analysis helped NERSC find peak performance for its computing platform, providing an all-in-one tool for cache-aware roofline analysis.

Learn More

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
United States United States
You may know us for our processors. But we do so much more. Intel invents at the boundaries of technology to make amazing experiences possible for business and society, and for every person on Earth.

Harnessing the capability of the cloud, the ubiquity of the Internet of Things, the latest advances in memory and programmable solutions, and the promise of always-on 5G connectivity, Intel is disrupting industries and solving global challenges. Leading on policy, diversity, inclusion, education and sustainability, we create value for our stockholders, customers and society.
This is a Organisation

42 members

Comments and Discussions

 
-- There are no messages in this forum --