Getting There from Here
In the process of developing a specialized app to download, manually parse, and display some four hundred elements of a web page, a good amount of discovery was made about the process of multithreading. Some of it was straightforward; some not so much.
Every effort to speed up an app’s execution increases the complexity of the code driving that execution. When it comes to exceptionally large or complicated projects, my personal preference is to start at the most primitive level – the slowest executing but most simple code – and modify it for speed in controllable increments, where a stable base of operations is always an arm’s length away.
I began with a brute-force, primitive approach to every step. Download the Json data from the web server. When that’s complete, delimit and parse the elements that needed it, and finally, display that data.
As expected, it rapidly became apparent that with such a primitive approach, processing time was going to be less than optimal. When the initial coding was complete, I was seeing a 7 to 10 second wait whereas the browser was displaying data in about one second.
What’s wrong with this picture? I’m coding in pure assembly language across the board. I should be running rings around this lumbering browser.
The first, and most obvious, bottleneck was the download itself. The Json data I was retrieving was typically over 5 megabytes, and my main thread was sitting idle most of the time while waiting for the server to deliver the next piece of information. Nothing else in my app moved until the entire 5+ megs had been downloaded. That was a problem.
I proceeded to log the size of the Json data for each of my 400 elements, to see what the average per-element size would be. The average came out to just over 13k of source data per element. Being a sucker for nice, round numbers, I stepped my per-read download size down from one meg to 16,384 bytes. Instead of waiting on 5+ megs to download before being able to do any other work, I was now only waiting on 16k individual blocks.
I implemented a process for delimiting elements within the Json data, keeping in mind that each element could split between 16k blocks, and a single 16k block could conceivably contain more than one element. As fast as elements became fully present in memory, new worker threads were created to process them.
If there’s a way to avoid Windows-based synchronization between threads, I’m going to use it. Using a mutex, for example, is messy. A mutex has to be created and destroyed, and the wait functions for implementing it are an unknown as far as their efficiency. What is the OS actually doing when it’s “waiting for a single object?” How much of that wait is unnecessary?
I was able to use a much simpler method that bypassed formal synchronization. All the data downloaded was stored in one global memory block. I know, I know; those things were deprecated a century ago and I should be using the heap. But I don’t use the heap because it’s notoriously messy, limited in size, and complex far beyond what I feel there is any call for. It’s a personal thing. Using global memory is my own little taboo that I engage in with reckless abandon. The operative point is that all the data downloaded was residing in a single global memory block, and each element needed to be extracted from that block while the block’s location in memory could and did change frequently. I define my memory blocks as moveable, meaning every time I resize them, they can take up residence in an entirely different location. So how are such dynamic pointers passed to a worker thread?
Elementary, my dear Watson. I can pass one parameter to a thread function. I pass a pointer to a pair of values: a pointer to the target element within the download memory block, and its size. The main function launches the thread, then cycles through a wait loop, waiting for the size value it passed to be reset to zero, indicating the worker thread has completed making a local copy of the message assigned to it. This relieves the worker thread from having to chase down the constantly-moving data for the element it’s concerned with. Within the main thread, that movement is frozen only long enough for each worker thread to make a local copy – a thread's very first order of business.
When the final thread has been launched, the original memory that the Json file was read into can be released. It has seen its last access.
From there, the worker thread could handle every remaining step: parsing the Json data, downloading the thumbnail image attached to it, loading WIC to process that image (which could be JPG, PNG, GIF, or even BMP format), formatting the text to be displayed, and finally displaying the completed element.
My discovery of the “maximum thread count” issue was completely accidental. With so many concurrent threads executing, I was hitting the Windows-imposed limit of 10,000 GDI handles per process. While 98% of these handles were quite temporary, they nevertheless accumulated to a point where GDI calls were failing because I had too many open handles. There was no other option but to limit the number of threads that could be concurrently executing.
I started with a maximum of 256 concurrent threads, incrementally lowering that maximum until the GDI handle problem went away. In the process of stepping down, down, down the max number of concurrent threads allowed, I noticed the overall execution time of the entire process was fluctuating wildly. That’s when it became apparent that there is an optimal maximum for the number of threads executing concurrently. But how could I know what that number is?
Calibrate. Test the waters. There is no other way. There are so many factors impacting the optimal maximum that attempting to isolate them all could never be done reliably. Instead, I created a high-speed calibration process that worked by using the assembly instruction RDTSC (read time stamp counter) to time execution of 256 concurrent test threads. Then half that number was tested, and half again, repeating until the time went up instead of down. Once that occurred, half the distance between the last two times was tested. On my particular test system, that sequence started at 256 threads, then 128, 64, 32, 16, and 8. The time went up from 16 to 8, so I moved to a point half way between the two. On my system, the optimal number turned out to be 16 – the number of cores on my i9 9900k. But this number cannot be automatically assumed to be optimal. I’ve seen the target number change just because of what a thread is doing. When feasible, it’s better to test reality than to rely on theory.
Once the optimal maximum was established via my calibration process, the download could begin. Each time a full Json element was read and delimited, it went into a queue. If there were fewer than the max number of threads executing, a new thread was launched with the oldest queue entry and that entry was nullified.
The overall process ended up executing about 3x faster than the browser, which was more in line with what I was expecting.
In today’s operating systems, with today’s hardware, so many things are going on concurrently that it’s never safe to assume anything. One could argue that my determination of the optimal maximum number of concurrently executing threads, together with the queue system I implemented to respect that maximum, is overkill. But considering a large file download with around 400 separate elements that each need to be processed by a separate thread, the net time savings made the coding effort well worth the trouble involved. The end result ended up executing in the neighborhood of 10x faster than what I had originally coded. Threads have to be managed, and management means overhead. More is not necessarily merrier.
Looking at when, where, and why your app might be waiting in a do-nothing loop for some event to occur, you can’t go wrong looking for ways to keep the CPU busy doing other things while it’s waiting. The more you can think “parallel,” the faster your app is likely to be. How this is done is not always obvious, but web downloads, as with hard drive access, are notoriously slow and are always good places to start. In my particular case, at the time of writing this article (February, 2019), my system is top of the line; an i9 9900k, 64 gigs of DDR4, and a very high speed SSD instead of a hard drive. (Obviously, a year or so down the road, this system will be a dinosaur.) So the optimal max thread count I come up with won’t apply to the next system. This is why the calibration function runs.
Step back from the code and look at what your app is doing from a conceptual vantage point. Quantify, summarize, and observe. Identify idle points and bottlenecks and get creative about giving the CPU something constructive to do in those times. A lot of good can come from doing it.
I work as a contract developer, specializing in assembly language driver software for specialized hardware. I also focus on identifying and removing performance bottlenecks in any software, using assembly language modifications, as well as locating hard-to-find bugs. I’m a specialist; I prefer to stay away from everyday, mundane development. I can fix problems in high level languages but as a rule I don’t like working with them for general development. There is no shortage of experienced developers in any language besides assembly. I am very much a niche programmer, a troubleshooter; the closer to the hardware I work, the happier I am. I typically take on the tasks others can’t or don’t want to do.