What's the best way to implement parallel asymmetric nested for loops?

Question

1.80/5 (2 votes)

See more:

Is there a way to efficiently parallelize nested for loops like:

C++

for (int i = 0; i < m; i++)
  for (int j = i + 1; j < n; j++)
    doSomethingWithIandJ(i, j);

in a way that each processing thread do approximately the same number of operations?

Posted 2-Dec-13 10:20am

Ali AslRousta

Updated 2-Dec-13 10:51am

v3

Add a Solution

Comments

Sergey Alexandrovich Kryukov 2-Dec-13 16:22pm

Please tag: language, platform. I can guess, but this place is not a guessing game... Yes, you are right about the possibility to use parallel calculations...
—SA

Manfred Rudolf Bihy 2-Dec-13 17:55pm

Language and platform really don't matter that much here as long both support parallel processing. The question is if the algorithm itself can be processed in a parallel fashion. If f (n, m) would depend on f (n+x, m+y) with unknown limits to to the offsets, possibly also non-linear, it might even be impossible to calculate said for loop in parallel. Long story short, to know if it can be done we need the details of the algorithm you intend to implement.

Sergey Alexandrovich Kryukov 2-Dec-13 18:27pm

Using or not using, say, Parallel library makes a big difference, especially of you answer the question. As to the possibility of parallel calculation: all the offsets are shown, nothing unlinear. The only problem is possible side-effect of doSomethingWithIandJ. If there is no a side effect (say, modification of the original matrix indexed with i and j), parallel calculation is quite possible. It is not possible if such modifications take place, as it creates a race condition.
—SA

Ali AslRousta 2-Dec-13 18:33pm

Well, I need to state that doSomethingWithIandJ function has no side effects, and yes, it can be done in parallel safely. To clarify the question, there are points I need to mention:
- The question is not about the feasibility of parallelism, but what's the best way parallelism can be done.
- As can be seen, the function doSomethingWithIandJ is called for each pairs of (i, j) where j > i, i < m and j < n. We can assume m = n without losing generality and conclude that doSomethingWithIandJ is called N = n(n+1)/2 times by these two nested for loops.
What I want to know is that if I use, say, 2 threads to do the same job, is there a way to split this code in a way that each thread approximately/exactly call doSomethingWithIandJ function N/2 times?

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Stefan_Lang · Accepted Answer · 2013-12-03T22:17:00

Various parallel frameworks such as OpenMP offer ways to automatically spread workload without your need to control it. Check your preferred parallel frameworks documentation if you're using one.

Without framework, a good pattern to use is master/slave or worker pattern: One "master" thread will be responsible for creating "work items", and "slave" or "worker" threads will pick them up and process them. You can have the master delegate the work items to specific slaves, but it's often more efficient to use autonomous slaves that pick up the next work item from a pool when they're idle.

In your case, each work item would consist of just one pair of ints, and possibly a reference to some data structure containing the results of the processing. The master would just pump out (put into the work pool) (i,j) work items, and the slaves would then pick them up as fast as they can process each one.

Of course you need to manage the threads, and implement some way to notify the master when all work items have been processed.

Please note that parallelization creates overhead: If the doSomething() function is trivial, it may be prudent not to parallelize it, or maybe only parallelize the outer loop. Be sure to implement performance tests and verify that you actually gain performance!

You can search for more info with the keywords "master/slave pattern" or "worker threads" or "work pool".

P.S.:
One possible improvement on the master/slave pattern is to use any context information you have to calculate the "optimal" amount of work a slave should pick up and then have each slave pick up p>1 work items rather than one at a time. You'll probably need to do a number of performance tests before you are able to gauge what is optimal, or whether you can come up with a good formula dependend on n and m alone, but if the time to process doSomething for one pair (i,j) is roughly constant, this should be possible.

If you can't determine a good work amount à priori, you can also monitor performance at runtime, and adapt the number of work items per worker dynamically. This could be much more scalable.

Better yet you can use either of the above methods to increase the workload for each work item from just one (i,j) pair to multiple pairs. Doing so will greatly decrease the amount of communication and synchronization required to manage the threads.