Begin Parallel Programming With OpenMP

uusheikh

4.70/5 (65 votes)

5 Jun 2007CPOL7 min read

Improve program performance using OpenMP on multi-core systems

Introduction

Parallel processing has always been an interesting method to improve program performance. Lately, there are more computers with multi processors or multi-core CPUs thus making parallel processing available to the masses. However, having multiple processing units on the hardware does not make existing programs take advantage of it. Programmers must take the initiative to implement parallel processing capabilities to their programs to fully utilize the hardware available. OpenMP is a set of programming APIs which include several compiler directives and a library of support functions. It was first developed for use with Fortran and now it is available for C and C++ as well.

Types of Parallel Programming

Before we begin with OpenMP, it is important to know why we need parallel processing. In a typical case, a sequential code will execute in a thread which is executed on a single processing unit. Thus, if a computer has 2 processors or more (or 2 cores, or 1 processor with HyperThreading), only a single processor will be used for execution, thus wasting the other processing power. Rather than letting the other processor sit idle (or process other threads from other programs), we can use it to speed up our algorithm.

Parallel processing can be divided into two groups, task based and data based.

Task based: Divide different tasks to different CPUs to be executed in parallel. For example, a Printing thread and a Spell Checking thread running simultaneously in a word processor. Each thread is a separate task.
Data based: Execute the same task, but divide the work load on the data over several CPUs. For example, to convert a color image to grayscale. We can convert the top half of the image on the first CPU, while the lower half is converted on the second CPU (or as many CPUs as you have), thus processing in half the time.

There are several methods to do parallel processing:

Use MPI: Message Passing Interface - MPI is most suited for a system with multiple processors and multiple memory. For example, a cluster of computers with their own local memory. You can use MPI to divide workload across this cluster, and merge the result when it is finished. Available with Microsoft Compute Cluster Pack.
Use OpenMP: OpenMP is suited for shared memory systems like we have on our desktop computers. Shared memory systems are systems with multiple processors but each are sharing a single memory subsystem. Using OpenMP is just like writing your own smaller threads but letting the compiler do it. Available in Visual Studio 2005 Professional and Team Suite.
Use SIMD intrinsics: Single Instruction Multiple Data (SIMD) has been available on mainstream processors such as Intel's MMX, SSE, SSE2, SSE3, Motorola's (or IBM's) Altivec and AMD's 3DNow!. SIMD intrinsics are primitive functions to parallelize data processing on the CPU register level. For example, the addition of two unsigned char will take the whole register size, although the size of this data type is just 8-bit, leaving 24-bit in the register to be filled with 0 and wasted. Using SIMD (such as MMX), we can load 8 unsigned chars (or 4 shorts or 2 integers) to be executed in parallel on the register level. Available in Visual Studio 2005 using SIMD intrinsics or with Visual C++ Processor Pack with Visual C++ 6.0.

Getting Started

Setting Up Visual Studio

To use OpenMP with Visual Studio 2005, you need:

Visual Studio 2005 Professional or better
A multi processor or multi core system to see speed improvement
An algorithm to parallelize!

Visual Studio 2005 uses OpenMP standard 2.0 and is located in the vcomp.dll (Located in Windows>WinSxS> (ia64/amd64/x86)Microsoft.VC80.(Debug)OpenMP.(*) directory) dynamic library.

To use OpenMP, you must do the following:

Include <omp.h>
Enable OpenMP compiler switch in Project Properties
That's all!

Understanding the Fork-and-Join Model

OpenMP uses the fork-and-join parallelism model. In fork-and-join, parallel threads are created and branched out from a master thread to execute an operation and will only remain until the operation has finished, then all the threads are destroyed, thus leaving only one master thread.

The process of splitting and joining of threads including synchronization for end result are handled by OpenMP.

How Many Threads Do I Need?

A typical question is how many threads do I actually need? Are more threads better? How do I control the number of threads in my code? Are number of threads related to number of CPUs?

The number of threads required to solve a problem is generally limited to the number of CPUs you have. As you can see in the Fork-and-Join figure above, whenever threads are created, a little time is taken to create a thread and later to join the end result and destroy the threads. When the problem is small, and the number of CPUs are less than the number of threads, the total execution time will be longer (slower) because more time has been spent to create threads, and later switch between the threads (due to preemptive behaviour) then to actually solve the problem. Whenever a thread context is switched, data must be saved/loaded from the memory. This takes time.

The rule is simple, since all the threads will be executing the same operation (hence the same priority), 1 thread is sufficient per CPU (or core). The more CPUs you have, the more threads you can create.

Most compiler directives in OpenMP use the Environment Variable OMP_NUM_THREADS to determine the number of threads to create. You can control the number of threads with the following functions:

C++

// Get the number of processors in this system
int iCPU = omp_get_num_procs();

// Now set the number of threads
omp_set_num_threads(iCPU);

Of course, you can put any value for iCPU in the code above (if you do not want to call omp_get_num_procs), and you can call the omp_set_num_threads functions as many times as you like for different parts of your code for maximum control. If omp_set_num_threads is not called, OpenMP will use the OMP_NUM_THREADS Environment Variable.

Parallel for Loop

Let us start with a simple parallel for loop. The following is a code to convert a 32-bit Color (RGBA) image to 8-bit Grayscale image.

C++

// pDest is an unsigned char array of size width * height
// pSrc is an unsigned char array of size width * height * 4 (32-bit) 
// To avoid floating point operation, all floating point weights 
// in the original grayscale formula
// has been changed to integer approximation 

// Use pragma for to make a parallel for loop 
omp_set_num_threads(threads); 

#pragma omp parallel for
for(int z = 0; z < height*width; z++) 
{ 
    pDest[z] = (pSrc[z*4+0]*3735 + pSrc[z*4 + 1]*19234+ pSrc[z*4+ 2]*9797)>>15; 
}

The #pragma omp parallel for directive will parallelize the for loop according to the number of threads set. The following is the performance gained for a 3264x2488 image on a 1.66GHz Core Duo system (2 Cores).

Thread(s) : 1 Time 0.04081 sec 
Thread(s) : 2 Time 0.01906 sec 
Thread(s) : 4 Time 0.01940 sec 
Thread(s) : 6 Time 0.02133 sec 
Thread(s) : 8 Time 0.02029 sec

As you can see, by executing the problem using 2 threads on a dual-core CPU, the time has been cut by half. However, as the number of threads is increased, the performance does not due to increased time to fork and join.

Parallel double for Loop

The same problem above (converting color to grayscale) can also be written in a double for loop way. This can be written like this:

C++

for(int y = 0; y < height; y++) 
    for(int x = 0; x< width; x++) 
     pDest[x+y*width] = (pSrc[x*4 + y*4*width + 0]*3735 + 
	pSrc[x*4 + y*4*width + 1]*19234+ pSrc[x*4 + y*4*width + 2]*9797)>>15;

In this case, there are two solutions.

Solution 1

We have made the inner loop parallel using the parallel for directive. As you can see, when 2 threads are being used, the execution time has actually increased! This is because, for every iteration of y, a fork-join operation is performed, and at the end contributed to the increased execution time.

C++

for(int y = 0; y < height; y++) 
{ 
    #pragma omp parallel for
    for(int x = 0; x< width; x++) 
    { 
    pDest[x+y*width] = (pSrc[x*4 + y*4*width + 0]*3735 + 
	pSrc[x*4 + y*4*width + 1]*19234+ pSrc[x*4 + y*4*width + 2]*9797)>>15; 
    } 
}

Thread(s): 1 Time 0.04260 sec 
Thread(s): 2 Time 0.05171 sec

Solution 2

Instead of making the inner loop parallel, the outer loop is the better choice. Here, another directive is introduced - the private directive. The private clause directs the compiler to make variables private so multiple copies of a variable do not execute. Here, you can see that the execution time did indeed get reduced by half.

C++

int x = 0; 
#pragma omp parallel for private(x) 
for(int y = 0; y < height; y++) 
{ 
    for(x = 0; x< width; x++) 
    { 
    pDest[x+y*width] = (pSrc[x*4 + y*4*width + 0]*3735 + 
	pSrc[x*4 + y*4*width + 1]*19234+ pSrc[x*4 + y*4*width + 2]*9797)>>15; 
    } 
}

Thread(s) : 1 Time 0.04039 sec 
Thread(s) : 2 Time 0.02020 sec

Get to Know the Number of Threads

At any time, we can obtain the number of OpenMP threads running by calling the function:

C++

int omp_get_thread_num();

Summary

By using OpenMP, you can gain performance on multi-core systems for free, without much coding other than a line or too. There is no excuse not to use OpenMP. The benefits are there, and the coding is simple.

There are several other OpenMP directives such as firstprivate, critical sections and reductions which are left for another article. :)

Points of Interest

You can also perform similar operation using Windows threads, but it will be more difficult to implement, because of issues such as synchronization and thread management.

Additional Notes

The original RGB to Grayscale formula is given as Y = 0.299 R + 0.587 G + 0.114 B.

References

Parallel Programming in C with MPI and OpenMP (1st ed.), Michael J. Quinn, McGraw Hill, 2004
Scientific Parallel Computing, L. Ridgway Scott, Terry Clark, Babak Bagheri, Princeton University Press, 2005
Introduction to Parallel Computing: A Practical Guide with Examples in An W.P. Petersen, P. Arbenz, Oxford University Press, 2004.
Digital Color Imaging Handbook, Gaurav Sharma, CRC Press, 2003

History

June 6, 2007 - Added - MPI capabilities available with Microsoft Compute Cluster Pack
June 6, 2007 - Initial article (This is my first article on CodeProject!)

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)