Parallel Computing and Data Parallelism

logicchild

4.95/5 (13 votes)

Feb 2, 2010

CPOL

14 min read

39272

An article that focuses on Data Parallelism based on the Multi-Core Processor Technology.

Abstract

There is a trend in microprocessor development computing industry that avoids increasing clock speeds and focuses on multi-cores with separate execution units. Threads or other multiprocessing techniques will require multithreading practices. The future of microprocessor programming and development might just standardize the multi-core technology. Multi-core is typically a term used to describe two or more CPUs working together on the same chip. Multi-core technology is a type of architecture where a single physical processor contains the core logic of two or more processors. These processors are packaged into a single integrated circuit (IC). These single integrated circuits are called a die. Multi-core can also refer to multiple dies packaged together. Multi-core enables the system to perform more tasks with a greater overall system performance. Multi-core technology can be used in desktops, mobile PCs, servers, and workstations. There is a contrast with dual-core, a single chip containing two separate processors (execution cores) in the same IC.

In this parallel computing movement, developers have had to look at various ways that sequential algorithms can de decomposed into sub-problems suitable for parallel execution. This can get confusing because in documentation, the terms concurrency and data parallelism can be used interchangeably. When writing an algorithm to use concurrency, the most important design choice that needs to be made by the developer is how to partition the original problem into individual sub-parts. There are three broad approaches taken to solve this design issue: data, task, and message parallelism. The purpose of this article is to focus on the first approach: data parallelism.

In order to run the sample code, the user must have either Microsoft Visual Studio 2005 or Visual Studio 2008 (although Visual Studio 2010 is better suited towards parallel computing because of the Parallel PF extensions that have been included). The choice of library is going to be the Intel Threading Building Blocks performance library (TBB). This means that the reader must download an evaluation of the Intel C++ Compiler for Windows, install it in order to accomplish integration with Microsoft Visual Studio, and then install the Intel TBB performance library. The components of the TBB are defined in the namespace tbb. The reason why I am choosing the Intel TBB is because Microsoft and Intel have worked collaboratively to examine the best usage of these cores to provide a sturdier platform for Windows 7. If they are the ones (or one of) the companies perfecting this multi-core technology, then we'll use one of their threading libraries to learn about parallel computing. Because the focus of this paper is implicit threading, we will examine the OpenMP compiler directives as well. Of course, we'll conclude with a brief description of the Microsoft parallel computing mechanisms used for Visual Studio 2010 and .NET 4.

A Quick Note on Installing the Intel TBB

With Windows systems, you have to modify your Visual Studio Project Build configurations (debug, release) as follows: In C++ Properties, General choice, add the include directory $(TBB22_INSTALL_DIR)\include. In the General choice of Linker properties, add this additional library directory: $(TBB22_INSTALL_DIR)\ia32\vc8\lib. In the Input choice, add as an additional dependency: tbb_debug.lib or tbb.lib.

What is Data Parallelism?

Data parallelism uses the input data to some operation as the means to partition into smaller pieces. Data is divided up between the available processors in order to achieve parallelism. This partitioning step is often followed by replicating and executing some mostly independent program operation across these partitions. Typically, it's the same operation that is applied concurrently to the elements in the dataset. Consider a simple loop parallelism. When data parallelism is used, the first thing to consider is how to break the iteration space into independent units of work. This is why this article will focus on the Intel Threading Building Blocks performance library to achieve better performance in application development by utilizing the multi-core technology. As such, it will touch on data parallelism by examining some of the generic algorithm templates used for loop parallelism that are provided by the Intel TBB. The reader should take care to note that every part of a program is not comprised of loops. But for the parts of the program that can use it, it should be the first choice. But, what exactly is the Intel TBB?

In short, the Intel TBB is a standard C++ template-based library (the STD Library for C++) for loop-level parallelism that concentrates on defining tasks, rather explicit threads (such as POSIX threads). Generic parallel algorithms, concurrent containers, low-level synchronization primitives, and a task scheduler comprise the Intel TBB. Developers using TBB can parallelize the execution of loop iterations by treating chunks of iterations as tasks and allowing the TBB task scheduler to determine the task sizes, number of threads to use, assignment of tasks to those threads, and how those threads are scheduled for execution. This data parallelism methodology increases scalability. The upper limit on parallelism is typically much larger because loop counts are often large and dependent on the dynamic size of the data that must be operated on. The amount of data on which programs must operate on normally grows over time, and while processor clock speeds have begun to slow, the growth in disk usage has not. Growth in data sizes in a data parallel program translates into the exposure of more parallelism opportunities that can scale to use many processors as they become available. This is why the Intel TBB performance library should play a vital role in developing applications that scale.

The Intel TBB is a performance library that is geared towards achieving scalability and data parallelism by taking advantage of multi-core processors. This library is a type of threading library that, by definition, is thread safe - it uses iterators and algorithms based on the ISO C++ Standard Template Library. Intel TBB also defines concurrent containers for hash tables, vectors, and queues. The C++ STL containers are not thread-safe. The TBB containers are designed for safe use, with multiple threads attempting concurrent access to the containers. Not only can you use these containers in conjunction with the TBB parallel algorithms, but you can also use them within concurrent codes implemented with other threading libraries.

Learning the Intel Threading Building Blocks involves understanding recursion, task-stealing, and algorithm templates. Algorithm templates will appear to play the most significant role in TBB. TBB offers the following types of generic parallel algorithms for loop parallelism: the parallel_for and parallel_reduce algorithm templates and the parallel_scan algorithm template. Known as scan, the latter is a template function that computes a prefix computation in parallel (y[i] = y[i-1]op(x[i]). Having said that, let's now consider the parallel_for template of TBB, starting with a simple loop example. Suppose you want to apply a function Foo to each element of an array, and it is safe to process each element concurrently. Below is the sequential code to perform this. Below is the original loop code:

void SerialApplyFoo( float a[], size_t n ) {
  for( size_t i=0;  i< n; ++i )
    Foo(a[i]);

The iteration space here is of type size_t, and it goes from 0 to n-1. The type size_t corresponds to the integral data type returned by the language operator sizeof, and is defined in the cstring header file (among others) as an unsigned integral type. The template function tbb::parallel_for breaks this iteration space into chunks and runs each chunk on a separate thread. The first step in parallelizing this loop is to convert the loop body into a form that operates on a chunk. The form is a Standard Template Library (STL)-style function object, called the body object, in which operator( ) processes a chunk. The class shown below declares the body object.

#include "tbb/blocked_range.h"
class ApplyFoo {
float *const my_a;
public:
    void operator( )( const blocked_range& r ) const {
   float *a = my_a;
   for( size_t i=r.begin(); i!=r.end( ); ++i )
 Foo(a[i]);
      }
  ApplyFoo( float a[] ) :
  my_a(a)
     {}
};

Notice the iteration space argument to operator( ). blocked_range<t> is a template class provided by the library. It describes a one-dimensional iteration space over type T. The class parallel_for works with other kinds of iteration spaces, too. The library provides blocked_range2d for two-dimensional spaces. An instance of ApplyFoo needs member fields that remember all the local variables that were defined outside the original loop but were used inside it. Usually, the constructor for the body object will initialize these fields, though parallel_for does not.

The parallel_for template parallelizes tasks that are contained within a for loop. The template requires two parameters: a range type over which to iterate, and a body type that executes iterations over the range or subrange. The range class must define a copy constructor and a destructor, the methods is_empty() (which returns TRUE if the range is empty) and is_divisible() (which returns TRUE if the range can be split), and a splitting constructor (to divide the range in half). A partitioner class object can be used to heuristically find the smallest number of iterations that should be assigned. Again, the Intel TBB library provides two predefined range types: blocked_range and blocked_range2D. The body class must define a copy constructor and destructor as well as the operator(). The operator() will contain a copy of the original serial loop that has been modified to run over a subrange of values that come from the range type.

The next template involved with loop parallelism is the parallel_reduce template. As both templates provide load-balanced, parallel execution of a fixed number of independent loop iterations, parallel_reduce will iterate over a range and combine partial results computed by each task requirements as parallel_for. The body type needs a splitting constructor and a join method. The splitting constructor in the body copies read-only data required to run the loop body and to assign the identity element of the reduction operation that initializes the reduction variable. The join method combines partial results of tasks based on the reduction operation being used. Consider this example that uses the parallel_reduce algorithm to launch threads and compute the numerical integration. The task scheduler breaks up the iteration range into smaller chunks. The chunk of loop iterations is considered a separate task that can be executed by a thread. Once the partialHeight variables calculated within each task are added together through the Join method, the final sum of heights is multiplied by the width and the computed approximation of PI is printed:

#include <stdio.h>
#include "tbb/parallel_reduce.h"
#include "tbb/task_scheduler_init.h"
#include "tbb/blocked_range.h"
#include "tbb/partitioner.h"

using namespace std;
using namespace tbb;
static long num_rects = 100000;
class MyPi {
    double *const my_rects;
public:
    double partialHeight;
    MyPi(double *const width) : my_rects(width), partialHeight(0) {}
    MyPi(MyPi& x, split) : my_rects(x.my_rects), partialHeight(0) {}

    void operator()(const blocked_range<size_t>& r) {
        double rectangleWidth = *my_rects;
        double x;
        for (size_t i = r.begin(); i != r.end(); ++i) {
            x = (i + 0.5) * rectangleWidth;
            partialHeight += 4.0/(1.+ x*x);
        }
    }

    void join(const MyPi& y) {partialHeight += y.partialHeight;}
};

int main(int argc, char* argv[])
{
    double area;
    double width = 1./(double)num_rects;
    MyPi my_block((double *const)&width);
    task_scheduler_init init;
    parallel_reduce(blocked_range<size_t>(0,num_rects), 
                    my_block, auto_partitioner());
    area = my_block.partialHeight * width;
    printf("The value of PI is %f\n",area);
    return 0;
}

To compile this code on the command line, you would have to have the environmental variable:

set PATH=%PATH%;.;C:\Program Files\Intel\Compiler\11.1\054\bin\ia32.

If you were to compile this file while already in the default directory, then you merely have to execute the iclvars_ia32.bat command. Use the /MD switch: icl.exe /MD ThePi.cpp tbb_debug.lib. The output of a successful compilation is: The value of PI is 3.141593.

This next example comes straight from the Intel TBB examples directory, and also uses the parallel_reduce algorithm. You can download examples like these and others from http://www.threadingbuildingblocks.org.

/
/ Example program that computes number of prime numbers up to n, 
// where n is a command line argument. The algorithm here is a 
// fairly efficient version of the sieve of Eratosthenes. 
// The parallel version demonstrates how to use parallel_reduce,
// and in particular how to exploit lazy splitting.

#include <cassert>
#include <cstdio>
#include <math.h>
#include <cstdlib>
#include <cstring>
#include <cctype>
#include "tbb/parallel_reduce.h"
#include "tbb/task_scheduler_init.h"
#include "tbb/tick_count.h"
using namespace std;
using namespace tbb;

typedef unsigned long Number;

//! If true, then print primes on stdout.
static bool PrintPrimes = false;

//! Grainsize parameter
static Number GrainSize = 1000;

class Multiples {
    inline Number strike( Number start, Number limit, Number stride ) {
        // Hoist "my_is_composite" into register for sake of speed.
        bool* is_composite = my_is_composite;
        assert( stride>=2 );
        for( ;start<limit; start+=stride ) 
            is_composite[start] = true;
        return start;
    }
    //! Window into conceptual sieve 
    bool* my_is_composite;

    //! Indexes into window
    /** my_striker[k] is an index into my_composite corresponding to
        an odd multiple multiple of my_factor[k]. */
    Number* my_striker;

    //! Prime numbers less than m.
    Number* my_factor;
public:
    //! Number of factors in my_factor.
    Number n_factor;
    Number m;
    Multiples( Number n ) : 
        is_forked_copy(false) 
    {
        m = Number(sqrt(double(n)));
        // Round up to even
        m += m&1;
        my_is_composite = new bool[m/2];
        my_striker = new Number[m/2];
        my_factor = new Number[m/2];
        n_factor = 0;
        memset( my_is_composite, 0, m/2 );
        for( Number i=3; i<m; i+=2 ) {
            if( !my_is_composite[i/2] ) {
                if( PrintPrimes )
                    printf("%d\n",(int)i);
                my_striker[n_factor] = strike( i/2, m/2, i );
                my_factor[n_factor++] = i;
            }
        }
    }

    //! Find primes in range [start,window_size), advancing my_striker as we go.
    /** Returns number of primes found. */
    Number find_primes_in_window( Number start, Number window_size ) {
        bool* is_composite = my_is_composite;
        memset( is_composite, 0, window_size/2 );
        for( size_t k=0; k<n_factor; ++k )
            my_striker[k] = strike( my_striker[k]-m/2, window_size/2, my_factor[k] );
        Number count = 0;
        for( Number k=0; k<window_size/2; ++k ) {
            if( !is_composite[k] ) {
                if( PrintPrimes )
                    printf("%ld\n",long(start+2*k+1));
                ++count;
            }
        }
        return count;
    }

    ~Multiples() {
        if( !is_forked_copy )
            delete[] my_factor;
        delete[] my_striker;
        delete[] my_is_composite;
    }

    //! True if this instance was forked from another instance.
    const bool is_forked_copy;

    Multiples( const Multiples& f, split ) :
        n_factor(f.n_factor),
        m(f.m),
        my_is_composite(NULL),
        my_striker(NULL),
        my_factor(f.my_factor),
        is_forked_copy(true)
    {}

    bool is_initialized() const {
        return my_is_composite!=NULL;
    }

    void initialize( Number start ) { 
        assert( start>=1 );
        my_is_composite = new bool[m/2];
        my_striker = new Number[m/2];
        for( size_t k=0; k<n_factor; ++k ) {
            Number f = my_factor[k];
            Number p = (start-1)/f*f % m;
            my_striker[k] = (p&1 ? p+2*f : p+f)/2;
            assert( m/2<=my_striker[k] );
        }
    }
};

//! Count number of primes between 0 and n
/** This is the serial version. */
Number SerialCountPrimes( Number n ) {
    // Two is special case
    Number count = n>=2;
    if( n>=3 ) {
        Multiples multiples(n);
        count += multiples.n_factor;
        if( PrintPrimes ) 
            printf("---\n");
        Number window_size = multiples.m;
        for( Number j=multiples.m; j<=n; j+=window_size ) { 
            if( j+window_size>n+1 ) 
                window_size = n+1-j;
            count += multiples.find_primes_in_window( j, window_size );
        }
    }
    return count;
}

//! Range of a sieve window.
class SieveRange {
    //! Width of full-size window into sieve.
    const Number my_stride;

    //! Always multiple of my_stride
    Number my_begin;

    //! One past last number in window.
    Number my_end;

    //! Width above which it is worth forking.
    const Number my_grainsize;

    bool assert_okay() const {
        assert( my_begin%my_stride==0 );
        assert( my_begin<=my_end );
        assert( my_stride<=my_grainsize );
        return true;
    } 
public:
    ---
    bool is_divisible() const {return my_end-my_begin>my_grainsize;}
    bool empty() const {return my_end<=my_begin;}
    SieveRange( SieveRange& r, split ) : 
        my_stride(r.my_stride), 
        my_grainsize(r.my_grainsize),
        my_end(r.my_end)
    {
        assert( r.is_divisible() );
        assert( r.assert_okay() );
        Number middle = r.my_begin + (r.my_end-r.my_begin+r.my_stride-1)/2;
        middle = middle/my_stride*my_stride;
        my_begin = middle;
        r.my_end = middle;
        assert( assert_okay() );
        assert( r.assert_okay() );
    }

    Number begin() const {return my_begin;}
    Number end() const {return my_end;}
    SieveRange( Number begin, Number end, Number stride, Number grainsize ) :
        my_begin(begin),
        my_end(end),
        my_stride(stride),      
        my_grainsize(grainsize<stride?stride:grainsize)
    {
        assert( assert_okay() );
    }
};

//! Loop body for parallel_reduce.
/** parallel_reduce splits the sieve into subsieves.
    Each subsieve handles a subrange of [0..n]. */
class Sieve {
public:
    //! Prime multiples to consider, and working storage for this subsieve.
    Multiples multiples;

    //! Number of primes found so far by this subsieve.
    Number count;

    //! Construct Sieve for counting primes in [0..n].
    Sieve( Number n ) : 
        multiples(n),
        count(0)
    {}

    
    void operator()( const SieveRange& r ) {
        Number m = multiples.m;
        if( multiples.is_initialized() ) { 
            // Simply reuse "multiples" structure from previous window
            // This works because parallel_reduce always applies
            // *this from left to right.
        } else {
            // Need to initialize "multiples" because *this is a forked copy
            // that needs to be set up to start at r.begin().
            multiples.initialize( r.begin() );
        }
        Number window_size = m;
        for( Number j=r.begin(); j<r.end(); j+=window_size ) { 
            assert( j%multiples.m==0 );
            if( j+window_size>r.end() ) 
                window_size = r.end()-j;
            count += multiples.find_primes_in_window( j, window_size );
        }
    }
    void join( Sieve& other ) {
        count += other.count;
    }
    Sieve( Sieve& other, split ) : 
        multiples(other.multiples,split()),
        count(0)
    {}
    // End of signatures required by parallel_reduce
    
};

//! Count number of primes between 0 and n
/** This is the parallel version. */
Number ParallelCountPrimes( Number n ) {
    // Two is special case
    Number count = n>=2;
    if( n>=3 ) {
        Sieve s(n);
        count += s.multiples.n_factor;
        if( PrintPrimes ) 
            printf("---\n");
        // Explicit grain size and simple_partitioner()
        // used here instead of automatic grainsize 
        // determination becase we want SieveRange
        // to be decomposed down to GrainSize or smaller.
        // Doing so improves odds that the working
        // set fits in cache when evaluating Sieve::operator().
        parallel_reduce( SieveRange( s.multiples.m, n, s.multiples.m, GrainSize ), 
                         s, simple_partitioner() );
        count += s.count;
    }
    return count;
}

//! A closed range of Number.
struct NumberRange {
    Number low;
    Number high;
    void set_from_string( const char* s );
    NumberRange( Number low_, Number high_ ) : low(low_), high(high_) {}
};

void NumberRange::set_from_string( const char* s ) {
    char* end;
    high = low = strtol(s,&end,0);
    switch( *end ) {
    case ':': 
        high = strtol(end+1,0,0); 
        break;
    case '\0':
        break;
    default:
        printf("unexpected character = %c\n",*end);
    }
}

//! Number of threads to use.
static NumberRange NThread(0,4);

//! If true, then at end wait for user to hit return
static bool PauseFlag = false;

//! Parse the command line.
static Number ParseCommandLine( int argc, char* argv[] ) {
    Number n = 100000000;
    int i = 1;
    if( i<argc && strcmp( argv[i], "pause" )==0 ) {
        PauseFlag = true;
        ++i;
    }
    if( i<argc && !isdigit(argv[i][0]) ) { 
        // Command line is garbled.
        fprintf(stderr,"Usage: %s [['pause'] n [nthread [grainsize]]]\n", argv[0]);
        fprintf(stderr,"where n is a positive integer [%lu]\n",n);
        fprintf(stderr,"      nthread is a non-negative integer, " 
                       "or range of the form low:high [%ld:%lu]\n", 
                       NThread.low,NThread.high);
        fprintf(stderr,"      grainsize is an optional " 
                "postive integer [%lu]\n",GrainSize);
        exit(1);
    }
    if( i<argc )
        n = strtol(argv[i++],0,0);
    if( i<argc )
        NThread.set_from_string(argv[i++]);
    if( i<argc )
        GrainSize = strtol(argv[i++],0,0);
    return n;
}

static void WaitForUser() {
    char c;
    printf("Press return to continue\n");
    do {
        c = getchar();
    } while( c!='\n' );
}

int main( int argc, char* argv[] ) {
    Number n = ParseCommandLine(argc,argv);

    // Try different numbers of threads
    for( Number p=NThread.low; p<=NThread.high; ++p ) {
        task_scheduler_init init(task_scheduler_init::deferred);
        // If p!=0, we are doing a parallel run
        if( p ) 
            init.initialize(p);

        Number count;
        tick_count t0 = tick_count::now();
        if( p==0 ) {
            count = SerialCountPrimes(n);
        } else {
            count = ParallelCountPrimes(n);
        }
        tick_count t1 = tick_count::now();

        printf("#primes from [2..%lu] = %lu (%.2f sec with ",
            (unsigned long)n, (unsigned long)count, (t1-t0).seconds());
        if( p ) 
            printf("%lu-way parallelism)\n", p );
        else 
            printf("serial code)\n");
    }
    if( PauseFlag ) {
        WaitForUser();
    }
    return 0;
}

The output for this code is interesting, as it shows how the timing factors are improved by the loop parallelism. Go to the C:\Program Files\Intel\Compilers\11.1\054\bin\ia32 directory and type iclvars_ia32.bat. Then, use the Intel C++ compiler.

#primes from [2..100000000] = 5761455 (0.29 sec with serial code)
#primes from [2..100000000] = 5761455 (0.30 sec with 1-way parallelism)
#primes from [2..100000000] = 5761455 (0.16 sec with 2-way parallelism)
#primes from [2..100000000] = 5761455 (0.16 sec with 3-way parallelism)
#primes from [2..100000000] = 5761455 (0.16 sec with 4-way parallelism)

The time improvement is a little less than 50%.

Look in the C:\Program Files\Intel\TBB\v2.2\examples\Getting Started\sub_string_finder\ directory:

There are three sub_string_finder examples that use the parallel_for algorithm template: sub_string_finder.cpp, sub_string_finder_pretty.cpp, and sub_string_finder_extended.cpp. Let's look at sub_string_finder_extended.cpp:

#include <iostream>
#include <string>
#include <algorithm>

#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"
#include "tbb/tick_count.h"

using namespace tbb;
using namespace std;
static const size_t N = 22;

void SerialSubStringFinder ( const string &str, 
           size_t *max_array, size_t *pos_array) {
    for ( size_t i = 0; i < str.size(); ++i ) {
        size_t max_size = 0, max_pos = 0;
        for (size_t j = 0; j < str.size(); ++j)
            if (j != i) {
                size_t limit = str.size()-( i > j ? i : j );
                for (size_t k = 0; k < limit; ++k) {
                    if (str[i + k] != str[j + k]) break;
                    if (k > max_size) {
                        max_size = k;
                        max_pos = j;
                    }
                }
            }
            max_array[i] = max_size;
            pos_array[i] = max_pos;
        }
    }

    class SubStringFinder {
        const string str;
        size_t *max_array;
        size_t *pos_array;
    public:
        void operator() ( const blocked_range<size_t>& r ) const {
            for ( size_t i = r.begin(); i != r.end(); ++i ) {
                size_t max_size = 0, max_pos = 0;
                for (size_t j = 0; j < str.size(); ++j) 
                    if (j != i) {
                        size_t limit = str.size()-( i > j ? i : j );
                        for (size_t k = 0; k < limit; ++k) {
                            if (str[i + k] != str[j + k]) break;
                            if (k > max_size) {
                                max_size = k;
                                max_pos = j;
                            }
                        }
                    }
                    max_array[i] = max_size;
                    pos_array[i] = max_pos;
            }
        }
        SubStringFinder(string &s, size_t *m, size_t *p) : 
        str(s), max_array(m), pos_array(p) { }
    };

    int main(int argc, char *argv[]) {

        string str[N] = { string("a"), string("b") };
        for (size_t i = 2; i < N; ++i) str[i] = str[i-1]+str[i-2];
        string &to_scan = str[N-1]; 

        size_t *max = new size_t[to_scan.size()];
        size_t *max2 = new size_t[to_scan.size()];
        size_t *pos = new size_t[to_scan.size()];
        size_t *pos2 = new size_t[to_scan.size()];
        cout << " Done building string." << endl;

        tick_count serial_t0 = tick_count::now();
        SerialSubStringFinder(to_scan, max2, pos2);
        tick_count serial_t1 = tick_count::now();
        cout << " Done with serial version." << endl;

        tick_count parallel_t0 = tick_count::now();
        parallel_for(blocked_range<size_t>(0, to_scan.size(), 100),
           SubStringFinder( to_scan, max, pos ) );
        tick_count parallel_t1 = tick_count::now();
        cout << " Done with parallel version." << endl;

        for (size_t i = 0; i < to_scan.size(); ++i) {
        if (max[i] != max2[i] || pos[i] != pos2[i]) {
            cout << "ERROR: Serial and Parallel Results are Different!" << endl;
        }
    }
     cout << " Done validating results." << endl;

     cout << "Serial version ran in " 
          << (serial_t1 - serial_t0).seconds() 
          << " seconds" << endl
          << "Parallel version ran in " 
          <<  (parallel_t1 - parallel_t0).seconds() 
          << " seconds" << endl
          << "Resulting in a speedup of " 
          << (serial_t1 - serial_t0).seconds() / (parallel_t1 - parallel_t0).seconds() 
          << endl;
    delete[] max;
    delete[] pos;
    delete[] max2;
    delete[] pos2;
    return 0;
}

When compiled and executed on an Intel Duo Core processor, this is the result:

Done building string.
Done with serial version.
Done with parallel version.
Done validating results.
Serial version ran in 20.9056 seconds
Parallel version ran in 12.2254 seconds
Resulting in a speedup of 1.71001

So by parallelizing the loop, we get an enormous performance improvement. The first parameter of the call to the parallel_for template function is a blocked_range object that describes the iteration space. Recall that blocked_range is a template class provided in the Intel TBB. The constructor takes three parameters: the lower bound of the range and the upper bound of the range. The second parameter to the parallel_for function is the function object to be applied to each subrange. We implement the body of the parallel_for loop, so at runtime, the template parallel_for automatically divides the range into sub ranges and invokes the SubStringFinder function object on each subrange. The definition of the SubSringFinder class populates the max and pos array elements found within the given subrange. The call to r.begin() returns the start of the subrange, and the r.end() method returns the end of the subrange.

The above description describes a few of the for loop algorithms provided by the TBB. They function as an example of implicit threading. Now, we will briefly look at a second library, OpenMP, which takes a different approach to achieving concurrency. OpenMP implements concurrency through special pragmas and directives inserted into your source code to indicate segments that are to be executed concurrently. These pragmas are recognized and processed by the compiler. Intel TBB uses defined parallel algorithms to execute methods within user-written classes that encapsulate the concurrent operations.

OpenMP directives demarcate code that needs to be executed in parallel (called parallel regions) and control how code is assigned to the threads. For C/C++, OpenMP uses pragmas as directives. All OpenMP pragmas have the same prefix of #pragma omp. This is followed by an OpenMP construct and one or more optional clauses to modify the construct. To define a parallel region within an application, use the parallel construct:

#pragma omp parallel

This pragma will be followed by a single statement or block of code enclosed with curly braces. When the application encounters this statement during execution, it will fork a team of threads, execute all of the statements within the parallel region on each thread, and join the threads after the last statement in the region. In many applications, a large number of independent operations are found in loops. Using the loop work sharing construct in OpenMP, you can split up these loop iterations and assign them to threads for concurrent execution. The parallel for construct will initiate a new parallel region around the single for loop following the pragma and divide the loop iterations among the threads of the team. Upon completion of the assigned iterations, threads sit at the implicit barrier at the end of the parallel region, waiting to join with the other threads.

It is possible to split up the combined parallel for construct into two pragmas: a parallel construct and the for construct, which must be lexically contained within a parallel region. Use this separation when there is parallel work for the thread team other than the iterations of the loop. You can also attach a schedule clause to the loop work-sharing construct to control how iterations are assigned to threads. The static schedule will divide iterations into blocks and distribute the blocks among threads before the loop iterations begin execution; round robin scheduling is used if there are more blocks than threads. The dynamic schedule will assign one block of iterations per thread in the team; as threads finish the previous set of iterations, a new block is assigned until all blocks have been distributed. There is an optional chunk argument for both static and dynamic scheduling that controls the number of iterations per block.

Here is another example, using OpenMP, for computing the value of PI:

#include <stdio.h>
static long num_rects = 1000000;
int main(int argc, char* argv[])
{
    double mid, height, width, sum=0.0;
    int i;
    double area;
    width = 1./(double)num_rects;
#pragma omp parallel for private(mid, height) reduction(+:sum)
    for (i=0; I < num_rects; i++) {
        mid = (i + 0.5)*width;
        height = 4.0/(1.+ mid*mid);
        sum += height;
    }
    area = width * sum;
    printf("The value of PI is %f\n",area);
    return 0;
}

The output is The value of PI is 3.141593.

Using Data Parallelism with .NET

When using a .NET language like C#, a for loop iteration space is typically a range of integers, while the foreach loops iterate over individual elements over a collection. So, how do we divvy these things up between the cores on the processor, or how do we define loop parallelism? For example, if we had to parallelize the loop shown below, how would we decide how many threads to use, how best to schedule them, how to assign iteration ranges to threads, and so on?

void For(int low, int high, Action<int> body)
{
    for (int i = low; i < high; i++)
    {
        body(i);
    }
}

The Parallel.For and Parallel.ForEach constructs derive from the Parallel class. Here is a program that also computes the value of PI, but uses several constructs to compute several values. A timer is added to reveal how much less time the Parallel.For and Parallel.ForEach constructs are used. Examine the code, which was compiled on .NET 4.0:

using System;
using System.Collections.Concurrent;
using System.Diagnostics;
using System.Linq;
using System.Threading.Tasks;

class Program
{
    const int num_steps = 100000000;

    /// <summary>Main method to time various
    /// implementations of computing PI.</summary>
    static void Main(string[] args)
    {
        while (true)
        {
            Time(() => SerialLinqPi());
            Time(() => ParallelLinqPi());
            Time(() => SerialPi());
            Time(() => ParallelPi());
            Time(() => ParallelPartitionerPi());

            Console.WriteLine("----");
            Console.ReadLine();
        }
    }

    /// <summary>Times the execution of a function and outputs
    /// both the elapsed time and the function's result.</summary>
    static void Time<t>(Func<t> work)
    {
        var sw = Stopwatch.StartNew();
        var result = work();
        Console.WriteLine(sw.Elapsed + ": " + result);
    }

    /// <summary>Estimates the value of PI using a LINQ-based implementation.</summary>
    static double SerialLinqPi()
    {
        double step = 1.0 / (double)num_steps;
        return (from i in Enumerable.Range(0, num_steps)
                let x = (i + 0.5) * step
                select 4.0 / (1.0 + x * x)).Sum() * step;
    }

    /// <summary>Estimates the value of PI using a PLINQ-based implementation.</summary>
    static double ParallelLinqPi()
    {
        double step = 1.0 / (double)num_steps;
        return (from i in ParallelEnumerable.Range(0, num_steps)
                let x = (i + 0.5) * step
                select 4.0 / (1.0 + x * x)).Sum() * step;
    }

    /// <summary>Estimates the value of PI using a for loop.<summary>
    static double SerialPi()
    {
        double sum = 0.0;
        double step = 1.0 / (double)num_steps;
        for (int i = 0; i < num_steps; i++)
        {
            double x = (i + 0.5) * step;
            sum = sum + 4.0 / (1.0 + x * x);
        }
        return step * sum;
    }

    /// <summary />Estimates the value of PI using a Parallel.For.</summary>
    static double ParallelPi()
    {
        double sum = 0.0;
        double step = 1.0 / (double)num_steps;
        object monitor = new object();
        Parallel.For(0, num_steps, () => 0.0, (i, state, local) =>
        {
            double x = (i + 0.5) * step;
            return local + 4.0 / (1.0 + x * x);
        }, local => { lock (monitor) sum += local; });
        return step * sum;
    }

    /// <summary>Estimates the value of PI using
    /// a Parallel.ForEach and a range partitioner.</summary>
    static double ParallelPartitionerPi()
    {
        double sum = 0.0;
        double step = 1.0 / (double)num_steps;
        object monitor = new object();
        Parallel.ForEach(Partitioner.Create(0, 
                 num_steps), () => 0.0, (range, state, local) =>
        {
            for (int i = range.Item1; i < range.Item2; i++)
            {
                double x = (i + 0.5) * step;
                local += 4.0 / (1.0 + x * x);
            }
            return local;
        }, local => { lock (monitor) sum += local; });
        return step * sum;
    }
}

The output:

00:00:06.4096024: 3.14159265359043
00:00:04.0270848: 3.14159265358991
00:00:01.5588931: 3.14159265359043
00:00:01.3326232: 3.14159265358968
00:00:00.8320172: 3.14159265358958

Notice the time differences when using loop parallelism. A serial (sequential) mechanism is used for the first computation, and the Parallel.For and Parallel.ForEach constructs are used towards the end of the program. The first computation took 6 seconds plus, while the latter computations took around 1 second or less.

Admittedly, this article only touched the surface of how powerful OpenMP and the Intel Threading Blocks are. The interested student of TBB should strive to understand the importance of recursion in those loop parallelism algorithms, and also gain a firm understanding of when to use fine-grained locks or course locks. This, of course, points to synchronization, mutual exclusion, and other multithreading techniques that ensure that when threads are executing and perhaps accessing resources, they are doing so in an orderly manner. Remember, the first prerequisite for loop parallelism is that they are thread-safe.

References

MSDN Library, for definitions and proof of concepts
C# 4.0 in a NutShell, by Joseph and Benhari
Concurrent Programming for Windows, by Joe Duffy