Click here to Skip to main content
15,891,136 members
Please Sign up or sign in to vote.
1.80/5 (2 votes)
See more:
Is there a way to efficiently parallelize nested for loops like:

C++
for (int i = 0; i < m; i++)
  for (int j = i + 1; j < n; j++)
    doSomethingWithIandJ(i, j);


in a way that each processing thread do approximately the same number of operations?
Posted
Updated 2-Dec-13 10:51am
v3
Comments
Sergey Alexandrovich Kryukov 2-Dec-13 16:22pm    
Please tag: language, platform. I can guess, but this place is not a guessing game... Yes, you are right about the possibility to use parallel calculations...
—SA
Manfred Rudolf Bihy 2-Dec-13 17:55pm    
Language and platform really don't matter that much here as long both support parallel processing. The question is if the algorithm itself can be processed in a parallel fashion. If f (n, m) would depend on f (n+x, m+y) with unknown limits to to the offsets, possibly also non-linear, it might even be impossible to calculate said for loop in parallel. Long story short, to know if it can be done we need the details of the algorithm you intend to implement.
Sergey Alexandrovich Kryukov 2-Dec-13 18:27pm    
Using or not using, say, Parallel library makes a big difference, especially of you answer the question. As to the possibility of parallel calculation: all the offsets are shown, nothing unlinear. The only problem is possible side-effect of doSomethingWithIandJ. If there is no a side effect (say, modification of the original matrix indexed with i and j), parallel calculation is quite possible. It is not possible if such modifications take place, as it creates a race condition.
—SA
Ali AslRousta 2-Dec-13 18:33pm    
Well, I need to state that doSomethingWithIandJ function has no side effects, and yes, it can be done in parallel safely. To clarify the question, there are points I need to mention:
- The question is not about the feasibility of parallelism, but what's the best way parallelism can be done.
- As can be seen, the function doSomethingWithIandJ is called for each pairs of (i, j) where j > i, i < m and j < n. We can assume m = n without losing generality and conclude that doSomethingWithIandJ is called N = n(n+1)/2 times by these two nested for loops.
What I want to know is that if I use, say, 2 threads to do the same job, is there a way to split this code in a way that each thread approximately/exactly call doSomethingWithIandJ function N/2 times?

1 solution

Various parallel frameworks such as OpenMP offer ways to automatically spread workload without your need to control it. Check your preferred parallel frameworks documentation if you're using one.

Without framework, a good pattern to use is master/slave or worker pattern: One "master" thread will be responsible for creating "work items", and "slave" or "worker" threads will pick them up and process them. You can have the master delegate the work items to specific slaves, but it's often more efficient to use autonomous slaves that pick up the next work item from a pool when they're idle.

In your case, each work item would consist of just one pair of ints, and possibly a reference to some data structure containing the results of the processing. The master would just pump out (put into the work pool) (i,j) work items, and the slaves would then pick them up as fast as they can process each one.

Of course you need to manage the threads, and implement some way to notify the master when all work items have been processed.

Please note that parallelization creates overhead: If the doSomething() function is trivial, it may be prudent not to parallelize it, or maybe only parallelize the outer loop. Be sure to implement performance tests and verify that you actually gain performance!

You can search for more info with the keywords "master/slave pattern" or "worker threads" or "work pool".


P.S.:
One possible improvement on the master/slave pattern is to use any context information you have to calculate the "optimal" amount of work a slave should pick up and then have each slave pick up p>1 work items rather than one at a time. You'll probably need to do a number of performance tests before you are able to gauge what is optimal, or whether you can come up with a good formula dependend on n and m alone, but if the time to process doSomething for one pair (i,j) is roughly constant, this should be possible.

If you can't determine a good work amount à priori, you can also monitor performance at runtime, and adapt the number of work items per worker dynamically. This could be much more scalable.

Better yet you can use either of the above methods to increase the workload for each work item from just one (i,j) pair to multiple pairs. Doing so will greatly decrease the amount of communication and synchronization required to manage the threads.
 
Share this answer
 
v2

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900