Multithreading slowdown, CPU bound app, probably hardware related

Question

0.00/5 (No votes)

See more:

The exact details of my multi-theaded application probably don't matter. It's completely CPU bound (runs for hours or days at a time), does no I/O, there is ample memory for all the threads, and the threads barely communicate with each other (for all practical purposes they don't communicate with each other). There are many sub-problems to be solved by the computation, each of which runs for several hours, and the only purpose of multi-threading is to keep all the processor cores busy all the time. The threads do share some large blocks of memory, but the shared memory is only read and is never written into. Except for the sharing of the large block of memory which is only read, the application could just as easily be a separate "job" or "task" for each processor core.

The problem is that the CPU time to solve each sub-problem goes up with the multi-threading. For example, suppose there is a small test problem that requires 250 CPU seconds to run as a single thread (and also 250 wall clock seconds). Running two such sub-problems at the same time as two threads might take 270 CPU seconds for each thread for a total 540 CPU seconds and 270 wall clock seconds. In a perfect world where everything "just works", the test of two sub-problems at the same time would require 250 CPU seconds for each thread for a total of 500 CPU seconds and 250 wall clock seconds.

Increasing the number of simultaneous sub-problems and threads up to the number of processor cores exacerbates the apparent slowdown. It's always faster overall to add threads up to the number of processor cores, but the use of n processor cores overall do not produce an n times speedup. For example, with a 16 processor core example, using 16 threads only produces about 8 or 9 times as much computation.

There seems obviously to be some sort of hardware cause such as contention on the memory bus or contention between the multiple processor cores on the same processor chip, etc. Is this a known problem? Is there anything I can do about it?

What I have tried:

One possible problem is that I run some of the speed tests on a high end laptop with a single quad-core processor. Running all the processor cores at 100% generates heat which causes laptop power management to intervene and slow down the processor clock speeds because slower clock speeds generate less heat. I have tweaked power management settings to no avail. I also have an ancient server with 64GB memory and four quad-core processors, for a total of 16 cores. Running speed tests on this machine produce results that are essentially equivalent to the results on the laptop. So laptop heat and power management may be an issue, but if so it does not seem to be the main issue. Depending on the exact size of the speed test that I run, it can be slightly faster to run 16 separate jobs with one thread each or it can be slightly faster to run 1 big job with sixteen threads. The contention therefore does not seem to be in my application. Rather, it seems to be in the hardware. I run under Windows 10 on the laptop and under Wine under Linux on the server.

Posted 15-May-18 8:49am

Member 8464407

Updated 16-May-18 22:20pm

Add a Solution

Comments

Kornfeld Eliyahu Peter 15-May-18 14:59pm

1. Thread management
2. Wait time in the queue
3. Lock between thread because wrong lock of memory
4. Initial memory size id too small so need reallocation
5. Exception handling block
6. You try to handle threads on your own without pool

Just thinking aloud...
Are we talking C++?

Peter_in_2780 15-May-18 19:10pm

It's almost certainly OS not hardware. Even though *you* have no more threads than cores, there is other background activity that pops up and invokes the OS thread scheduler. What you should look at is Processor Affinity (start with Google), where you can pin threads to cores and save the OS having to think about it. (I spent years tuning multiprocessor systems in the dim distant past.)

Dave Kreskowiak 15-May-18 19:45pm

You do realize that your threads are sharing the processor with an O/S that also has it's own threads? A typical Windows installation has about 800+ threads running, 200 of which are in the kernel itself. My main machine has about 1300 threads running after boot. A single Chrome browser tab is 18 threads.

I know most threads are sitting idle, but the ones that do run, however briefly, add up quick.

And you're wondering why you're not getting a linear boost in speed for your 16 threads or whatever you're using?

Patrice T 15-May-18 21:01pm

What kind of problem are you solving with those threads ?

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Member 8464407 · Answer 1 · 2018-05-16T17:51:00

Much thanks for the responses. I think I need to add a little extra information.

I am perhaps naive, but I really did hope for nearly a 16x speedup on my ancient server (Dell R900) with 16 processor cores. It's a Linux machine and my Windows program is running under Wine. The "top" command usually says that my program is getting 1580% to 1590% of the CPU out of a total of 1600%. There are certainly other processes running but they are taking very few cpu cycles. My program does no I/O, my threads do not interact (no queues, no locking, no waiting, etc.), and there is ample memory. (When I run on Windows, there is more "Windows junk" running in the background, but I can still usually get 350% percent or so of the machine for my program out of 400% possible with four processor cores.)

I really do not believe my program has a multi-threading design problem (but of course it's certainly possible it does). Therefore, I decided to try to create a complete dummy program which illustrates the problem and which is small enough to post. Here it is.

C++

// timing_test.cpp : dummy program to test the timing of multiple CPU bound threads.
//

#include "stdafx.h"
#include <stdio.h>
#include <vector>
#include <algorithm>
#include <boost/timer/timer.hpp>
#include <boost/thread.hpp>

	std::vector<int> numbers;	// vector to contain 10,000,000 numbers to be sorted as dummy work.
	void timing_test_2(void);	// function prototype for thread worker

		class Comp_modulus	// functor for comparing intergers mod k
	{
		public:

		int k;	// k value to calculate modulus

		bool operator()(const std::vector<int>::iterator &x, const std::vector<int>::iterator &y)
		{
			return ( ( (*x) % k ) < ( (*y) % k )  );	// compare the integers by their modulus rather than by their
														// .... values to sort in strange ways. It is iterators that
														// .... are being sorted, not the integers themselves.
		}
	};

int main(void)
{
	int N = 16;					// number of threads to create
	numbers.resize(10000000);	// storage for 10,000,000 numbers

	int i = 0;
	for (std::vector<int>::iterator it = numbers.begin(); it != numbers.end(); it++)
		(*it) = i++; // fill in the numbers 0 to 9,999,999 which will be sorted in a variety of strange ways.

	std::cout << "beginning timing test, number of threads = " << std::setw(2) << (int)N << std::endl;	

	boost::timer::auto_cpu_timer auto_t;	// Quick and dirty overall timer. It prints automatically
											// .... when it is destructed at the end of the program.

	boost::thread_group tg;									// Define a thread group in preparation ....
															// .... for creating the threads	

	for (int i =  0; i < N; i++)
	{
		tg.create_thread( boost::bind( timing_test_2) );	// Create the N threads
	}

	tg.join_all();											// Wait until all the threads are done.

	std::cout << "completed timing test" << std::endl;

	return 0;
}

void timing_test_2(void)		// this is the thread that creates the artificial load to simulate work.
{
	std::vector<std::vector<int>::iterator> numbers_p;	// storage for iterators (pointers) which point to the array of integers
	numbers_p.resize( numbers.size() );					// allocate the correct amount of storage, same number of iterators
														// ... as integers to which they point.

	std::vector<int>::iterator it = numbers.begin();	// point to the beginning of the integer array.

	for (std::vector<std::vector<int>::iterator>::iterator it_it = numbers_p.begin(); it_it != numbers_p.end(); it_it++)
		(*it_it) = it++;	// populate the iterator vector with iterators pointing the the integers

	Comp_modulus comp_modulus;	// instantiate the functor which does strange compares

	// create the dummy work

	comp_modulus.k =  2; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
	comp_modulus.k =  3; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
	comp_modulus.k =  5; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
	comp_modulus.k =  7; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
	comp_modulus.k = 11; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
	comp_modulus.k = 13; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
	comp_modulus.k = 17; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
	comp_modulus.k = 19; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
	comp_modulus.k = 23; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
	comp_modulus.k = 29; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );

}

As you can see, it's C++. I compile on Windows with Visual Studio 2010. Hence, I use boost for some timing and multi-threading functions. If you don't have boost installed and if you have a new enough C++ compiler, it would be straightforward to accomplish the timing and multi-threading with C++ standard library functions.

What my test program does looks pretty strange, but it actually mirrors quite well what my real program does. The test program first creates an array (std::vector) containing 10,000,000 integers in the range of 0 to 9,999,999. It then creates an artificial workload for the purposes of timing by sorting the integers by various prime moduli. For example, sorting by a modulus 2 sorts all the even numbers in front of all the odd numbers. Using prime numbers for the moduli makes sure the numbers get scrambled pretty thoroughly and that std::sort has plenty of work to do.

The numbers being sorted are stored only a single time and conceptually are sorted simultaneously by each thread. However, there is no locking between threads and the numbers aren't really moved around by the sort at all. Rather, this magic is achieved by each thread creating it's own private set of pointers to the integers (iterators in C++ standard library terms). Then it's the pointers that are sorted rather than the integers.

In the timing test program, the integers are 32 bit and the pointers are 64 bit, so each thread could just as well have its own copy of the integers and sort them directly instead of having pointers and sorting the program. In my real program, the integers become 32 byte permutations and it's much more memory efficient for the threads to work with 64 bit pointers (8 bytes) than to work with 32 byte permutations. It's also much faster to sort items that are 8 bytes long than to sort items that are 32 bytes long.

The application is computational group theory and the problem is to attempt to enumerate an optimal solution for every single one of the possible positions of Rubik's Cube. Finding an optimal solution for a position is much more computationally intense than is simply finding a solution for a position. There are about 4.3 x 10^19 Rubik's cube positions, so it's an extremely non-trivial problem. It's ultimate solution will undoubtedly require many hundred if not thousands or hundreds of thousands of machines working on the problem in a piece-wise fashion.

Here are the results of running the dummy timing program on my ancient server. The test program is so simple minded that I had to run it 16 different times, changing the number of threads each time. You can see how the wall clock and cpu times do not scale up as you might hope. For example, with one thread the wall clock time and the cpu time are both about 24 seconds. So one would hope would be that with two threads the wall clock time would be about 24 seconds and that the cpu time would be about 48 seconds. But the real figures are about 26.5 seconds and about 53 seconds, respectively. The problem only gets worse with more threads. The same thing happens on my Windows machine. And the same thing happens if you get rid of the threads and just run multiple instances of a one thread program.

Jerry Bryan

beginning timing test, number of threads =  1
completed timing test
 24.109956s wall, 23.990000s user + 0.080000s system = 24.070000s CPU (99.8%)

beginning timing test, number of threads =  2
completed timing test
 26.558628s wall, 52.820000s user + 0.210000s system = 53.030000s CPU (199.7%)

beginning timing test, number of threads =  3
completed timing test
 27.632885s wall, 82.410000s user + 0.310000s system = 82.720000s CPU (299.4%)

beginning timing test, number of threads =  4
completed timing test
 33.766357s wall, 120.150000s user + 0.390000s system = 120.540000s CPU (357.0%)

beginning timing test, number of threads =  5
completed timing test
 34.820234s wall, 152.480000s user + 0.510000s system = 152.990000s CPU (439.4%)

beginning timing test, number of threads =  6
completed timing test
 33.279489s wall, 183.450000s user + 0.680000s system = 184.130000s CPU (553.3%)

beginning timing test, number of threads =  7
completed timing test
 35.896250s wall, 235.820000s user + 0.890000s system = 236.710000s CPU (659.4%)

beginning timing test, number of threads =  8
completed timing test
 36.156149s wall, 275.370000s user + 1.170000s system = 276.540000s CPU (764.8%)

beginning timing test, number of threads =  9
completed timing test
 38.529383s wall, 319.610000s user + 1.540000s system = 321.150000s CPU (833.5%)

beginning timing test, number of threads = 10
completed timing test
 40.542099s wall, 367.840000s user + 1.970000s system = 369.810000s CPU (912.2%)

beginning timing test, number of threads = 11
completed timing test
 45.129656s wall, 421.800000s user + 2.340000s system = 424.140000s CPU (939.8%)

beginning timing test, number of threads = 12
completed timing test
 44.265782s wall, 485.820000s user + 2.650000s system = 488.470000s CPU (1103.5%)

beginning timing test, number of threads = 13
completed timing test
 45.339959s wall, 545.310000s user + 3.190000s system = 548.500000s CPU (1209.7%)

beginning timing test, number of threads = 14
completed timing test
 45.510356s wall, 587.020000s user + 3.660000s system = 590.680000s CPU (1297.9%)

beginning timing test, number of threads = 15
completed timing test
 47.840413s wall, 651.430000s user + 4.310000s system = 655.740000s CPU (1370.7%)

beginning timing test, number of threads = 16
completed timing test
 49.448533s wall, 731.640000s user + 4.400000s system = 736.040000s CPU (1488.5%)

Patrice T · Answer 2 · 2018-05-16T22:21:00

Solution 2

Quote:
I am perhaps naive, but I really did hope for nearly a 16x speedup on my ancient server (Dell R900) with 16 processor cores.

the speedup can only apply when the task is split between the threads, each doing only a part of the job in parallel.
As far as I can see, in your code, each thread is doing the full job.

Quote:
I really do not believe my program has a multi-threading design problem (but of course it's certainly possible it does). Therefore, I decided to try to create a complete dummy program which illustrates the problem and which is small enough to post. Here it is.

In your program, each thread is sorting a 10,000,000 array 10 times. This kind of activity is trashing the cpu cache. So the runtime is doe to continuous cache miss, thus preventing the cpu from taking advantage of cores.

By the way, the thread do not share memory, each thread works on its own array of 10,000,000, and is sorting it 10 times.

Posted 16-May-18 22:21pm

Patrice T

Updated 17-May-18 2:22am

v2

Comments

Member 8464407 17-May-18 23:57pm

My apologies for "answering" my question instead of "improving my question". I've been writing assembly language and multithreading code since the 1960's, but I'm new to this particular forum. (I write in C++ now instead of assembly.)

I think I'll have to accept the cpu cache being trashed as the most likely culprit for lack of a better theory. I can certainly see that my app is making such random accesses to memory that that the cpu cache is getting pretty badly trashed. It's still hard to see why the extra threads seem to aggravate the problem so much, especially on the 16 core machine if I'm only testing with 3 or 4 threads and if the other cores are completely idle. So that' the real question - if cache trashing is the problem, why do the extra threads make the cache trashing so much worse?

My real app really is quite parallel. Each thread is solving a completely different sub-problem. You don't have to have queues and locks and that sort of thing to have true parallel processing. Indeed, my early versions of this program did have lots of queues and locks and signals and other such inter-thread communication. It was really slowing the program down so I redesigned it to get rid of all that stuff and it really speeded up. All I had to do was tell each thread what sub-problem to work on, wait a few hours while the thread worked on it's designated sub-problem (this is a really large problem), and then get the results back.

I could have done different work in each thread of the dummy program for doing the timing test instead of making each thread do the same work. But for the timing test it didn't really matter what the threads did as long as they chewed up cycles and as long as they wildly accessed all over a large array like my real app does.

Both my real app and the timing test do have a bunch of shared memory, and each thread of the real app and each thread of the timing test hits the shared memory pretty hard. Each thread does have its own local set of pointers to the shared memory and it's the pointers that get sorted. But if you look at the compare functor that's used with std::sort, it dereferences the pointers (iterators) and does its compare against the real data in the shared memory. The shared memory is read only so it's thread safe, but it does get used and it does get used very, very heavily. That has been a concern for me with this design. The trouble is that with the real problem, the shared memory is far and away the biggest consumer of memory in the whole program and I therefore really need it to be shared.

On the idea of affinity, I did try that suggestion. I know how with Windows to set affinity for separate instances of my program, but I can't figure out how with Windows to set affinity for individual threads. So I tested with setting affinity for multiple instances of my program where each instance was single threaded, and it seemed to make no difference. Extra instances increase the CPU time for each instance. I suspect that even with affinity, Windows could still be dispatching other threads from other Windows processes to "my" designated cores since those other threads from other Windows processes do not have affinity set to any particular core. It seems like what I really need is a way to tell Windows not to use "my" cores for anybody but me.

Patrice T 18-May-18 2:36am

My apologies for "answering" my question instead of "improving my question".
No problem for me, everyone was a newbe once.
"if cache trashing is the problem, why do the extra threads make the cache trashing so much worse?"
more threads is like reducing the size of cache of each thread.

For testing purpose, replace the sorting with just filling the array with a value and see if you gain from multicores.