Comments by Member 8464407 (Top 1 by date)

Member 8464407 17-May-18 23:57pm View

My apologies for "answering" my question instead of "improving my question". I've been writing assembly language and multithreading code since the 1960's, but I'm new to this particular forum. (I write in C++ now instead of assembly.)

I think I'll have to accept the cpu cache being trashed as the most likely culprit for lack of a better theory. I can certainly see that my app is making such random accesses to memory that that the cpu cache is getting pretty badly trashed. It's still hard to see why the extra threads seem to aggravate the problem so much, especially on the 16 core machine if I'm only testing with 3 or 4 threads and if the other cores are completely idle. So that' the real question - if cache trashing is the problem, why do the extra threads make the cache trashing so much worse?

My real app really is quite parallel. Each thread is solving a completely different sub-problem. You don't have to have queues and locks and that sort of thing to have true parallel processing. Indeed, my early versions of this program did have lots of queues and locks and signals and other such inter-thread communication. It was really slowing the program down so I redesigned it to get rid of all that stuff and it really speeded up. All I had to do was tell each thread what sub-problem to work on, wait a few hours while the thread worked on it's designated sub-problem (this is a really large problem), and then get the results back.

I could have done different work in each thread of the dummy program for doing the timing test instead of making each thread do the same work. But for the timing test it didn't really matter what the threads did as long as they chewed up cycles and as long as they wildly accessed all over a large array like my real app does.

Both my real app and the timing test do have a bunch of shared memory, and each thread of the real app and each thread of the timing test hits the shared memory pretty hard. Each thread does have its own local set of pointers to the shared memory and it's the pointers that get sorted. But if you look at the compare functor that's used with std::sort, it dereferences the pointers (iterators) and does its compare against the real data in the shared memory. The shared memory is read only so it's thread safe, but it does get used and it does get used very, very heavily. That has been a concern for me with this design. The trouble is that with the real problem, the shared memory is far and away the biggest consumer of memory in the whole program and I therefore really need it to be shared.

On the idea of affinity, I did try that suggestion. I know how with Windows to set affinity for separate instances of my program, but I can't figure out how with Windows to set affinity for individual threads. So I tested with setting affinity for multiple instances of my program where each instance was single threaded, and it seemed to make no difference. Extra instances increase the CPU time for each instance. I suspect that even with affinity, Windows could still be dispatching other threads from other Windows processes to "my" designated cores since those other threads from other Windows processes do not have affinity set to any particular core. It seems like what I really need is a way to tell Windows not to use "my" cores for anybody but me.