OpenMP code stalling

Question

0.00/5 (No votes)

See more:

Hello everyone,

I have a computer vision algorithm in which the convolution algorithm must be run in parallel on multicore processors using openMP. The code is of the form as shown below.

C++

void convolve(Bitmap &src,Kernel &kernel, Bitmap &dst)
   #pragma omp parallel for
   for(int y = 0; y < height; ++y){
      for(int x = 0; x <width; ++x){

          kernel.response(src,x,y,dst);
      }
   }

The kernel is an interface

C++

class Kernel {

public: virtual ~Kernel();

public: virtual void response(Bitmap &src, int x, int y, Bitmap &dst) = 0;

};

The problem is that implementations of kernel are not known by the compiler and they can be complex. So is this code capable of being parallelised using openMP?

If I run the code it actually runs slower than the serial version and it visibly stalls when running real time image recognition NOTE: this is not the case when openMP is disabled. I'am using visual studio express 2013 with "/openmp" enabled

EDIT:

Example below shows a Kernel implementation for an algorithm to change bitmap to gray scale. The convolution operator is inherently a parallel problem as is most computer vision algorithms, yet parallelizing such algorithms with shared memory is hard, especially due to race conditions. Maybe I should consider using GPU's for doing hardware acceleration of my algorithms NOTE: the code runs in real - time even on a single core, but I wanted to speed things up for mobile platforms, as the code is to be run on mobile devices for Augment Reality apps, thus multicore is the way to go.

Mostly if I use "omp parallel for shared(dst)" the code runs fast with minimal stalling but I feel like the race conditions are still there, is there a hardware implementation for avoiding false sharing and race conditions without using the expensive "critical"? I tried "atomic" but it's only for primitives such as addition operations. And why assignment operator is not supported by "atomic"?
I also just recently came across openMP and was excited about it until these issues came around :-(

C++

// Kernel for changing bitmap to gray scale
class GrayScale: public Kernel {
 
public: void response(Bitmap &src, int x, int y, Bitmap &dst)
{
Pixel &sp = src.pData[x + y*src.width];
 
Pixel &dp = dst.pData[x + y*dst.width];
 
unsigned char gray = static_cast<unsigned char>(0.3*sp.red + 0.5*sp.green + 0.2*sp.blue);
 
#pragma omp critical
{
dp.red = dp.green = dp.blue = gray;
}
}
};

where

C++

// Pixel data for ARGB_8888 bitmap format
struct Pixel {
unsigned char red;
unsigned char green;
unsigned char blue;
unsigned char alpha;
};

Posted 19-Dec-14 5:19am

BupeChombaDerrick

Updated 19-Dec-14 11:21am

v4

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

bling · Answer 1 · 2014-12-19T07:20:00

Solution 1

Your for loop is probably too fine grained.

Try breaking the work up into chunks.

C++

void convolve_inner(Bitmap &src,Kernel &kernel, Bitmap &dst, int low, int high)
{
   // process one stripe or band of image data.
   for(int y = low; y < high; ++y)
   {
      for(int x = 0; x < width; ++x)
      {
         kernel.response(src, x, y, dst);
      }
   }
}

void convolve(Bitmap &src, Kernel &kernel, Bitmap &dst)
{
   int chunk = (width + 7) / 8;
   #pragma omp parallel for
   for(int low = 0; low < width; low += chunk)
   {
      int high = low + chunk;
      if (high > width) high = width;
      convolve_inner(src, kernel, dst, low, high);
   }
}

Posted 19-Dec-14 7:20am

bling

Comments

BupeChombaDerrick 19-Dec-14 14:40pm

Actually the bitmap data is just a 1D array of pixels, so the data is just one big strip.How about the race conditions with the bitmaps? Especially the "dst" bitmap needs to be synchronized. I tried using critical within the implementations of "Kernel" when writing to the bitmap but yielded no benefits. For example

// Kernel for changing bitmap to gray scale
class GrayScale: public Kernel {

public: void response(Bitmap &src, int x, int y, Bitmap &dst)
{
Pixel &sp = src.pData[x + y*src.width];

Pixel &dp = dst.pData[x + y*dst.width];

unsigned char gray = static_cast<unsigned char="">(0.3*sp.red + 0.5*sp.green + 0.2*sp.blue);

#pragma omp critical
{
dp.red = dp.green = dp.blue = gray;
}
}
};

where

// Pixel data for ARGB_8888 bitmap format
struct Pixel {
unsigned char red;
unsigned char green;
unsigned char blue;
unsigned char alpha;
};

[no name] 19-Dec-14 16:09pm

You may want to update your question.

Is there a risk of the multiple response function threads reading or writing the same pixel? I did not see any in the sample response code above.

What are typical values of width and height?

You describe the data as one big strip. If the strips are relatively small (a few thousand pixels), there may not be much value in adding MP support. You'll spend more cycles on context switches than actual work.

If the images are relatively large - like a million pixels, working on chunks of 64 k should work well.

BupeChombaDerrick 19-Dec-14 16:55pm

Typical values are width = 640, height = 480, thus about 300K pixels, the Kernel interface has two bitmaps as arguments the one designated as "dst" is mostly writeonly and "src" is readonly. Thus bitmap pixels for "dst" are expected to be modified by the threads, so one thing is that shared arrays of data like this are hard to parallelize but yet they are inherently parallel.

[no name] 22-Dec-14 12:45pm

Did you try the "chunked" approach above?

If you assign each pixel row to a thread, you'll have 480 context switches for a 640x480 image.

If you operate on chunks (say 48 or 96 rows, for example), you'll only need 5 or 10 context switches. Unless you have 50 or 100 core CPU, a chunked approach will be better.

BupeChombaDerrick 22-Dec-14 16:26pm

I'am currently looking at http://staff.city.ac.uk/~sbbh653/publications/OpenMP_SPM.pdf they seem to be using the full image without a "chunked" approach. Only adding a "#pragma omp parallel for" should have been sufficient. After adding "shared(src,dst)" it seems to work but don't know why.

[no name] 23-Dec-14 12:38pm

Compilers vary in their default choices for shared vs. private variables. You can use default(none) and explicitly choose which are private, firstprivate, or shared.

http://bisqwit.iki.fi/story/howto/openmp/#PrivateFirstprivateAndSharedClauses

BupeChombaDerrick 23-Dec-14 16:45pm

Okay so I shouldn't worry about race conditions or false sharing? I just have to choose to share the source(src) and destination(dst) bitmaps among the threads and openMP will handle the synchronizations?

[no name] 24-Dec-14 13:34pm

In the response function you provided, I see a 1:1 relation of source to destination pixels with no overlap or side effects. There is nothing to synchronize. Is that true for other response function implementations? You've given no information so I have no answer.

You should be able to make that determination on your own.

Consider redesigning the "response" interface to operate on a whole image instead a single pixel. Any response implementation that benefits from OpenMP can make use of it. A per-pixel function call is expensive anyways.

BupeChombaDerrick 24-Dec-14 15:36pm

Okay, then whether to use synchronization techniques or not depends on the complexity of the response function. I have given your answer +5, but if you want me to accept it then change it so that it includes some of the discussion here. The "chunked" approach may work but for a simple kernel implementation it maybe an overkill. Thanks for the taking the time to help out.