Much thanks for the responses. I think I need to add a little extra information.
I am perhaps naive, but I really did hope for nearly a 16x speedup on my ancient server (Dell R900) with 16 processor cores. It's a Linux machine and my Windows program is running under Wine. The "top" command usually says that my program is getting 1580% to 1590% of the CPU out of a total of 1600%. There are certainly other processes running but they are taking very few cpu cycles. My program does no I/O, my threads do not interact (no queues, no locking, no waiting, etc.), and there is ample memory. (When I run on Windows, there is more "Windows junk" running in the background, but I can still usually get 350% percent or so of the machine for my program out of 400% possible with four processor cores.)
I really do not believe my program has a multi-threading design problem (but of course it's certainly possible it does). Therefore, I decided to try to create a complete dummy program which illustrates the problem and which is small enough to post. Here it is.
#include "stdafx.h"
#include <stdio.h>
#include <vector>
#include <algorithm>
#include <boost/timer/timer.hpp>
#include <boost/thread.hpp>
std::vector<int> numbers; void timing_test_2(void);
class Comp_modulus {
public:
int k;
bool operator()(const std::vector<int>::iterator &x, const std::vector<int>::iterator &y)
{
return ( ( (*x) % k ) < ( (*y) % k ) ); }
};
int main(void)
{
int N = 16; numbers.resize(10000000);
int i = 0;
for (std::vector<int>::iterator it = numbers.begin(); it != numbers.end(); it++)
(*it) = i++;
std::cout << "beginning timing test, number of threads = " << std::setw(2) << (int)N << std::endl;
boost::timer::auto_cpu_timer auto_t;
boost::thread_group tg;
for (int i = 0; i < N; i++)
{
tg.create_thread( boost::bind( timing_test_2) ); }
tg.join_all();
std::cout << "completed timing test" << std::endl;
return 0;
}
void timing_test_2(void) {
std::vector<std::vector<int>::iterator> numbers_p; numbers_p.resize( numbers.size() );
std::vector<int>::iterator it = numbers.begin();
for (std::vector<std::vector<int>::iterator>::iterator it_it = numbers_p.begin(); it_it != numbers_p.end(); it_it++)
(*it_it) = it++;
Comp_modulus comp_modulus;
comp_modulus.k = 2; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
comp_modulus.k = 3; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
comp_modulus.k = 5; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
comp_modulus.k = 7; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
comp_modulus.k = 11; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
comp_modulus.k = 13; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
comp_modulus.k = 17; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
comp_modulus.k = 19; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
comp_modulus.k = 23; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
comp_modulus.k = 29; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
}
As you can see, it's C++. I compile on Windows with Visual Studio 2010. Hence, I use boost for some timing and multi-threading functions. If you don't have boost installed and if you have a new enough C++ compiler, it would be straightforward to accomplish the timing and multi-threading with C++ standard library functions.
What my test program does looks pretty strange, but it actually mirrors quite well what my real program does. The test program first creates an array (std::vector) containing 10,000,000 integers in the range of 0 to 9,999,999. It then creates an artificial workload for the purposes of timing by sorting the integers by various prime moduli. For example, sorting by a modulus 2 sorts all the even numbers in front of all the odd numbers. Using prime numbers for the moduli makes sure the numbers get scrambled pretty thoroughly and that std::sort has plenty of work to do.
The numbers being sorted are stored only a single time and conceptually are sorted simultaneously by each thread. However, there is no locking between threads and the numbers aren't really moved around by the sort at all. Rather, this magic is achieved by each thread creating it's own private set of pointers to the integers (iterators in C++ standard library terms). Then it's the pointers that are sorted rather than the integers.
In the timing test program, the integers are 32 bit and the pointers are 64 bit, so each thread could just as well have its own copy of the integers and sort them directly instead of having pointers and sorting the program. In my real program, the integers become 32 byte permutations and it's much more memory efficient for the threads to work with 64 bit pointers (8 bytes) than to work with 32 byte permutations. It's also much faster to sort items that are 8 bytes long than to sort items that are 32 bytes long.
The application is computational group theory and the problem is to attempt to enumerate an optimal solution for every single one of the possible positions of Rubik's Cube. Finding an optimal solution for a position is much more computationally intense than is simply finding a solution for a position. There are about 4.3 x 10^19 Rubik's cube positions, so it's an extremely non-trivial problem. It's ultimate solution will undoubtedly require many hundred if not thousands or hundreds of thousands of machines working on the problem in a piece-wise fashion.
Here are the results of running the dummy timing program on my ancient server. The test program is so simple minded that I had to run it 16 different times, changing the number of threads each time. You can see how the wall clock and cpu times do not scale up as you might hope. For example, with one thread the wall clock time and the cpu time are both about 24 seconds. So one would hope would be that with two threads the wall clock time would be about 24 seconds and that the cpu time would be about 48 seconds. But the real figures are about 26.5 seconds and about 53 seconds, respectively. The problem only gets worse with more threads. The same thing happens on my Windows machine. And the same thing happens if you get rid of the threads and just run multiple instances of a one thread program.
Jerry Bryan
beginning timing test, number of threads = 1
completed timing test
24.109956s wall, 23.990000s user + 0.080000s system = 24.070000s CPU (99.8%)
beginning timing test, number of threads = 2
completed timing test
26.558628s wall, 52.820000s user + 0.210000s system = 53.030000s CPU (199.7%)
beginning timing test, number of threads = 3
completed timing test
27.632885s wall, 82.410000s user + 0.310000s system = 82.720000s CPU (299.4%)
beginning timing test, number of threads = 4
completed timing test
33.766357s wall, 120.150000s user + 0.390000s system = 120.540000s CPU (357.0%)
beginning timing test, number of threads = 5
completed timing test
34.820234s wall, 152.480000s user + 0.510000s system = 152.990000s CPU (439.4%)
beginning timing test, number of threads = 6
completed timing test
33.279489s wall, 183.450000s user + 0.680000s system = 184.130000s CPU (553.3%)
beginning timing test, number of threads = 7
completed timing test
35.896250s wall, 235.820000s user + 0.890000s system = 236.710000s CPU (659.4%)
beginning timing test, number of threads = 8
completed timing test
36.156149s wall, 275.370000s user + 1.170000s system = 276.540000s CPU (764.8%)
beginning timing test, number of threads = 9
completed timing test
38.529383s wall, 319.610000s user + 1.540000s system = 321.150000s CPU (833.5%)
beginning timing test, number of threads = 10
completed timing test
40.542099s wall, 367.840000s user + 1.970000s system = 369.810000s CPU (912.2%)
beginning timing test, number of threads = 11
completed timing test
45.129656s wall, 421.800000s user + 2.340000s system = 424.140000s CPU (939.8%)
beginning timing test, number of threads = 12
completed timing test
44.265782s wall, 485.820000s user + 2.650000s system = 488.470000s CPU (1103.5%)
beginning timing test, number of threads = 13
completed timing test
45.339959s wall, 545.310000s user + 3.190000s system = 548.500000s CPU (1209.7%)
beginning timing test, number of threads = 14
completed timing test
45.510356s wall, 587.020000s user + 3.660000s system = 590.680000s CPU (1297.9%)
beginning timing test, number of threads = 15
completed timing test
47.840413s wall, 651.430000s user + 4.310000s system = 655.740000s CPU (1370.7%)
beginning timing test, number of threads = 16
completed timing test
49.448533s wall, 731.640000s user + 4.400000s system = 736.040000s CPU (1488.5%)