Parallel.Invoke and Tasks slow. What am I doing wrong.

Question

5.00/5 (1 vote)

See more:

Okay, i am totally new to threading, I spent the last days going through online tutorials and decided that "parallel.invoke" or Tasks are the way to go for my problem.

The project I am working one is still an implementation of constructive solid geometry, and I thought that I could have massive performance gains by adding parallelism.
My algorithm has essentially these parts:
The A and B in the following text indicate from which 3D-Model the polygon came as CSG seeks to allow boolean operations on 3D Objects
- Determine lines of intersection from polygonA to polygonB and the other way round.
This creates a set of segments (for each polygon) in two list.

Here is some code of the sequential solution:

int facesB = modelB.Triangle.Count;
int facesA = modelA.Triangle.Count;

for (int i = 0; i < facesB; i++)
{
    for (int j = 0; j < facesA; j++)
    {
        DetermineIntersection(j, i);
    }
}

When intersecting two boxes this always takes around 30ms. I want it faster, so my thought was to split up the first for loop and using parallel.invoke.

int facesB = modelB.Triangle.Count;
int facesA = modelA.Triangle.Count;
int prlfB = (int)facesB / 2;
Parallel.invoke(()=>
{

  for (int i = 0; i < prlfB; i++)
  {
    for (int j = 0; j < facesA; j++)
    {
        DetermineIntersection(j, i);
    }
  }
}, ()=>
{
  for (int i = prlfB; i < facesB; i++)
  {
    for (int j = 0; j < facesA; j++)
    {
        DetermineIntersection(j, i);
    }
  }
}
);

I know that there is a certain overhead when working with threads, and when I split two boxes (a total of 24 polygons) I think I don't need to parallelize things.
The parallel.invoke was only slightly slower, so I thought if i raised the poly count (using Spheres with each 500 triangles instead of boxes (12 each)) I could really gain some speed. Nada. The sequential solution is nearly two times faster when working with higher polycounts.

Here is the (pseudo) code of DetermineIntersection

public void DetermineIntersection(int mApoly, int mBpoly)
  TriA = modelA.Triangles[mApoly];
  TriB = modelB.Triangles[mBpoly];
  // Determine intersection of the sides of Triangle B
  // against the plane of triangle A, and check if a valid intersectionSegment
  // can be returned
  for (int i=0; i<3; i++)
  {
    IntersectRayTrianglePlane(TriB.side(i), TriA, out intersectionSegment)
    if (success==true)
    {
       lock(listOfCutsAgainstA)
       {
         listOfCutsAgainstA.Add(intersectionSegment)
       }
    }
  }            
  // And now the same code where it checks if sides of TriangleA intersect with the plane of triangleB

- Next step is: Do some Triangulation to make sure that the previously found lines of intersection do not intersect any triangle anymore

Here again: A perfekt task for parallelizing, I do not even need to lock anything because they never write to the same resources.

But let's ignore those steps and focus on the one I mentioned.
It was somehow successful to use ParameterizedThreadStart, which was indeed faster, after a certain number of polygons was involved (as I expected it).
The tons of tutorials said that the new .Net 4.0 TPL is faster and cleaner and more flexible, so why do I get such strange results?

Same problem occurs with tasks.

When I use the Jordan Curve Theorem in a recursive algorithm to determine if a point is inside a 3D-Model, it takes not less than 10.000 triangles before there is any benefit to using Tasks instead of a normal, non threaded solution.

What am I doing wrong?

Any help appreciated. I hope I made everything as clear as possible.
Thanks in advance :rose:

Posted 17-Feb-11 1:16am

Jan Wolf

Updated 17-Feb-11 1:51am

v2

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Espen Harlinn · Accepted Answer · 2011-02-17T01:37:00

Solution 1

As I see it you the expected results - 30ms isn't all that long :)

This may partly explain why you are not seeing the expected perfomance gains:
http://www.profimatics.de/products/intime/_manuals/windows.nt.thread.latency.pdf[^] - while the article is quite old, this is still something we have to live with.

So parallel execution of really small tasks does not make sense ...

Update
It seems like implmenting your own class based on System.Threading.Tasks.TaskScheduler[^] can improve execution times, but the work required doesn't seem to justify the effort, compared to using ParameterizedThreadStart, unless you can reuse the implmentation.

Regards
Espen Harlinn

Posted 17-Feb-11 1:37am

Espen Harlinn

Updated 19-Feb-11 4:09am

v2

Comments

Jan Wolf 17-Feb-11 7:43am

Yes, you are right.
But raising the polycount makes the operation significantly slower.
The 30ms is only for two boxes, 24 triangles. That is not all that much.

When i use the parameterized threads and an increased polycount the threaded solution is indeed faster. Significantly if there are a few hundred triangles involved.

My first question is not how to speed this up, because I have a solution with the parameterized threads, but why parallel.invoke and tasks are so slow compared to ParameterizedThreadStart.
The second is: The parallel.invoke is slower, even when thousands of triangles are involved. The sequential solution takes 600ms, parallel.invoke a little more than 1000ms

But the tutorials I read suggested using the TPL, because it does not involve as much overhead, and replaces the old Threadpool thing, and was easier to use (I agree) while being faster.

Btw: 30ms is too long for this kind of operation, but the problem can be solved by optimizing some of my math routines. But thats not the question.
I will have a look at the paper you send me.
I am not expecting a doubled performance because I split this thing in two threads, but I suspect threading, no matter if "normal" Threads or the TPL thing, to be faster. But they seem to have a MASSIVE overhead, larger than "normal" Threads, which is contradictory to what I have read.

Espen Harlinn 17-Feb-11 8:40am

If you look at the API, there seems to be a number of provisions for various circumstances regarding the execution of tasks. That may explain why execution using parameterized threads works significantly better than parallel.invoke. There is also an overhead from the initialization of the threadpool. Try repeated execution of your code in a loop – you will, hopefully, see some improvement, but I’m not really sure. I would guess that parallel.invoke requires tasks to take a significant amount of time before the overhead becomes negligible. Remember, we are living in a world where a certain marketing department determines the truth – no matter what your test actually shows :) Play around with queries testing EF4 vs SQL against MS SQL Server, and you’ll bound to get a good laugh – of cause the MS SQL team is better at doing queries than the EF4 linq team …

Jan Wolf 18-Feb-11 2:01am

You might be right,
what confuses me especially is that the parallel.invoke solution only takes around 15ms longer when I am dealing with two boxes which is okay I guess, since like you said 30ms is not a long time. It makes no sense whatsoever, that when the only thing I increase is triangle count (the code in parallel.invoke stays the same, as posted above), this solution still takes longer, and the difference seems to increase almost linear.

I am going to accept your answer for now, still hoping someone knows how I could solve this, because parallel.invoke / tasks are more readable and elegant (visually). Having to have an own object just to pass parameters can get quite annoying, and also (especially in the "point in 3D object" algorithm) I have the need of return values, which is almost too simple when using tasks.

Thanks Espen

Sergey Alexandrovich Kryukov 17-Feb-11 19:22pm

This is very true and non-trivial, I like the answer, my 5.
--SA

Nish Nishant 19-Feb-11 10:10am

Voted 5!