OpenMP Is Turning 20!

Intel

0/5 (0 vote)

Aug 31, 2017

CPOL

7 min read

12183

Making Parallel Programming Accessible to C/C++ and Fortran Programmers

Get Intel® Compilers

Bronis R. de Supinski, Chief Technology Officer, Livermore Computing

I suppose that 20 years makes OpenMP* a nearly full-grown adult. In any event, this birthday provides us with a good opportunity to review the origins and evolution of the OpenMP specification. As the chair of the OpenMP Language Committee (a role that I have filled for the last nine years, since shortly after the release of OpenMP 3.0), I am happy to have the opportunity to provide this retrospective and to provide a glimpse into the health of the organization that owns OpenMP and into our future plans to keep it relevant. I hope you will agree that OpenMP remains true to its roots while evolving sufficiently to have reasonable prospects for another 20 years.

OpenMP arose in an era in which many compiler implementers supported unstandardized directives to guide loop-based, shared-memory parallelization. While these directives were often effective, they failed to meet a critical user requirement: portability. Not only did the syntax (i.e., spelling) of those implementation-specific directives vary but they also often exhibited subtle differences in their semantics. Recognizing this deficiency, Mary Zosel from Lawrence Livermore National Laboratory worked closely with the implementers to reach agreement on common syntax and semantics that all would provide in OpenMP 1.0. In addition, they created the OpenMP organization to own and to maintain that specification.

Many view OpenMP’s roots narrowly as exactly the loop-based, shared-memory directives that were standardized in that initial specification. I prefer to look at them more broadly as a commitment to support portable directive-based parallelization and optimization, possibly limited to within a process, through an organization that combines compiler implementers with user organizations who require their systems to deliver the highest possible portable performance. Ultimately, I view OpenMP as a mechanism for programmers to express key performance features of their applications that compilers would find difficult or even impossible to derive through (static) analysis. Organizationally, OpenMP is vibrant, with membership that includes representatives of all major compiler implementations (at least in the space relevant to my user organization) and an active and growing set of user organizations.

Technically, the OpenMP specification met its original goal of unifying the available loop-based parallelization directives. It continues to provide simple, well-defined constructs for that purpose. Further, OpenMP has fostered ongoing improvements to those directives, such as the ability to collapse loops and to control the placement of threads that execute the parallelized code. Achieving consistently strong performance for these constructs across shared-memory architectures and a complete range of compilers, OpenMP provides the portability that motivated its creation.

OpenMP’s evolution has led it to adopt several additional forms of parallelism. I am often annoyed to hear people in our community say that we need a standardized mechanism for task-based parallelism. OpenMP has provided exactly that for the last nine years! In 2008, the OpenMP 3.0 specification was adopted with a complete task-based model. While I acknowledge that OpenMP tasking implementations could still be improved, we face a chicken-and-the-egg problem. I often hear users state that they will use OpenMP tasking when implementations consistently deliver strong performance with the model. However, implementers also frequently state that they will optimize their tasking implementation once they see sufficient adoption of the model. Another issue for the OpenMP tasking model derives from one of OpenMP’s strengths—it provides consistent syntax and semantics for a range of parallelization strategies that can all be used in a single application. Supporting that generality necessarily implies inherent overheads. However, model refinements can alleviate some of that overhead. For example, OpenMP 3.1 added the final clause (and its related concepts) to facilitate specifying a minimum task granularity, which otherwise requires complex compiler analysis or much more complex program structure. As we work toward future OpenMP specifications, we are continuing to identify ways to reduce the overhead of OpenMP tasking.

OpenMP added support for additional forms of parallelism with the release of OpenMP 4.0. In particular, OpenMP now includes support for accelerator-based parallelization through its device constructs and for SIMD (or vector) parallelism. The support for the latter is particularly reminiscent of OpenMP’s origins. Support for SIMD parallelism through implementation-specific directives had become widespread, most frequently in the form of ivdep. However, the supported clauses varied widely and, in some cases, the spelling of the directives was also different. More importantly, the semantics often varied in subtle ways that could lead directives to be correct with one compiler but not with another. The addition of SIMD constructs to OpenMP largely solves these problems, just as the creation of OpenMP 20 years ago solved them for loop-based directives. Of course, the initial set of SIMD directives is not perfect and we continue to refine them. For example, OpenMP 4.5 added the simdlen clause, which allows the user to specify the preferred SIMD vector length to use.

The OpenMP target construct specifies that computation in a structured block should be offloaded to a device. Additional constructs that were added in the 4.0 specification support efficient parallelization on devices such as GPUs. Importantly, all OpenMP constructs can be used within a target region, which means that the years of advances in directive-based parallelization are available for a range of devices. While one still must consider which forms of parallelism will best map to specific types of devices, as well as to the algorithms being parallelized, the orthogonality of OpenMP constructs greatly increases the portability of programs that need to target multiple architectures.

We are actively working on the OpenMP 5.0 specification, with its release planned for November 2018. We have already adopted several significant extensions. OpenMP TR4 documents the progress on OpenMP 5.0 through last November. We will also release a TR this November that will document our continued progress. The most significant addition in TR4 is probably OMPT, which extends OpenMP with a tool interface. I anticipate that OpenMP 5.0 will include OMPD, an additional tool interface that will facilitate debugging OpenMP applications. TR4 included many other extensions, perhaps most notably the addition of task reductions.

We expect that OpenMP 5.0 will provide several major extensions that were not included in TR4. Perhaps most notably, OpenMP will greatly enhance support for complex memory hierarchies. First, I expect it to include some form of the memory allocation mechanism that TR5 documents. This mechanism will provide an intuitive interface for programmers to portably indicate which memory should be used for particular variables or dynamic allocations in systems with multiple types of memory (e.g., DDR4 and HBM). Further, I expect that OpenMP will provide a powerful serialization and deserialization interface that will support deep copy of complex data structures. This interface will eventually enable data-dependent layout transformations and associated optimizations with device constructs.

As I have already mentioned, the OpenMP organization is healthy. We have added several members in the past few years, both implementers and user organizations. We are currently discussing changes to the OpenMP bylaws to support even wider participation. If you are interested in shaping the future of OpenMP, we are always happy to have more participation! Visit openmp.org for more information about OpenMP.

Bronis R. de Supinski is the chief technology officer of Livermore Computing, the supercomputing center at Lawrence Livermore National Laboratory. He is also the chair of the OpenMP Language Committee.

This document was prepared as an account of work sponsored by an agency of the United States government. Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes. This work, LLNL-MI-732158, was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.