Unified Concurrency III - Cross-benchmarking

ipavlu

4.67/5 (5 votes)

Jun 11, 2018

CPOL

18 min read

11623

The goal of the Unified Concurrency is to unify access to different synchronization primitives in object-oriented fashion with one pattern and two interfaces for general and async/await methods.

Articles Series

Unified Concurrency I - Introduction
Unified Concurrency II - Benchmarking Methodologies
Unified Concurrency III - Cross-Benchmarking
Unified Concurrency IV - Going Cross-platform (.NET Core 2.1 / .NET Standard 2.0)

The first two articles prepared the stage for the benchmarking of all the implemented synchronization primitives under Unified Concurrency and cross-benchmarking to understand advantages, disadvantages, costs, and behavior of synchronizations primitives between each other based on the scenarios and circumstances.

The first group of benchmarks here studied will cover each synchronization primitive individually each for both scenarios, the Heavy Contention, and the Bad Neighbor:

Monitor (C# lock) - from the previous article
LockUC
SpinLockUC
TicketSpinLockUC
AsyncLockUC

The AsyncSpinLockUC and the AsyncTicketSpinLockUC won't be covered, because their behavior is exactly the same as for their general threading counterparts, the SpinLockUC and the TicketSpinLockUC. The AsyncSpinLockUC and the AsyncTicketSpinLockUC are always returning the completed awaiter implemented as a struct, not a class, so memory allocation is avoided and awaiting is avoided as well and then the behavior is obviously the same as for the non-async counterparts what is supported by measured data.

The second group will cover cross-benchmarking each for both scenarios, the Heavy Contention, and the Bad Neighbor:

Monitor (C# lock), SpinLockUC and TicketSpinLockUC
Monitor (C# lock), LockUC
Monitor (C# lock), LockUC and AsyncLockUC
Monitor (C# lock), LockUC and AsyncLockUC, SpinLockUC, TicketSpinLockUC

The Unified Concurrency framework is implemented in the open-source GreenSuperGreen library available on the GitHub and the Nuget (3 packages, lib, unit tests, benchmark).

Benchmark: Monitor (C# Lock) - Heavy Contention

The throughput period of Monitor, C# lock, behaves very well, monotonically in a similar way as sequential throughput period without surprises. But, the amount of spent CPU resources is troubling. The shorter the throughput period gets, then more CPU resources are wasted in heavy contention scenarios! That is actually significant as it is contrary to common sense! In general, we expect that short time throughput periods should result in lower CPU load, an expectation driven by thread suspend/resume technique, yet the Monitor, C# lock is driven in these cases by its hybrid nature. If the contention is high enough or there is a possibility that peak load will create high contention, then Monitor, C# lock become a serious bottleneck. There are two possible cases:

The high contention occurs normally, the code base is made for high throughput period and thus will be recognized and handled soon.
The high contention occurs occasionally during high peaks, that are difficult to re-create with the complex code base, then the performance issue of the code base can be easily overlooked and the issue can stay hidden for a long time.

The explanation is very simple. As it was stated, the Monitor, C# lock, is a hybrid lock. During an attempt to gain access, the thread is not suspended immediately when other thread already owns the access, but it is rather spin-waiting in an attempt to gain access sooner in the hope to limit thread suspend/resume costs and if that did not work, the thread is suspended. This hybrid approach was introduced during times of first multi-core boxes and it made throughput period faster then, but now with 24 cores or more, it comes with heavy costs and with each extra core, it gets worse.

Scenarios where we have just short work in Monitor, C# lock, but with many requests lose lots of CPU resources! Obviously, we cannot neglect legacy code-base and aging software architectures lagging behind hardware architectures anymore.

Based on the chart, throughput period below 64 ms on the given box should be avoided, because over 60% of the CPU resources will be dissipated into the heat! The number can get lower with a lower number of threads. Avoiding high levels of concurrency is always good checking point, whether we do not throw too much parallelism on the sequential problem. But the throughput period level at which this gets 60 % or more CPU resources wasted will go higher with more CPU cores, and that is the future trend where it goes currently.

Chart 1: Monitor (C# lock) - Heavy Contention Benchmark

Benchmark: Monitor (C# Lock) - Bad Neighbor

Please remember, that the Bad Neighbor scenario is about the cumulative effects of multiple unrelated thread pairs each with the minimal contention of only two threads - a pair. As was mentioned in the report above in Heavy Contention scenario for Monitor, C# lock, with low contention the throughput period accessible through Monitor, C# lock, can be significantly lower without great CPU wastage, but even here we can see, that Monitor, C# lock, is playing role in CPU wastage, twice as expected for throughput periods below 16 ms, thus Monitor, C# lock, is really a Bad Neighbor, few unrelated services/threads with low contention can cumulatively dissipate over 50 % CPU resources into heat with only 48 % of useful work being done. This statement is based on ideal sequential trend lines in the chart.

Chart 2: Monitor (C# lock) - Bad Neighbor Benchmark

Benchmark: LockUC - Heavy Contention

The LockUC is a new synchronization primitive based on TaskCompletionSource<object> in usage very similar to Monitor, C# lock, but avoids spin-waiting, the hybrid part of the Monitor, C# lock. Access is fair. LockUC was designed to outperform Monitor, C# lock, avoiding its hybrid nature in the area where thread suspends/resume technique is still useful and for legacy code, that does not require super high throughput period in the wide range of throughput periods, but in the same time does not dissipate CPU resources into heat needlessly. A thread unable to gain lock access is suspended immediately! This should be encouraged scenario for all cases where we do not expect high contention or even do not care about it. But, what we should care about is an impact of overlooked synchronization primitive in edge cases which are difficult to create in other parts of the service. We should go for most easy tools with the lowest impact and capable to guarantee its behavior. The LockUC is a good choice for classic threading without async/await option. LockUC clearly outperforms the Monitor, C# lock with fewer CPU resources wasted and even throughput period is better up to the 1.5ms throughput period with given box. Below 1.5 ms throughput period Monitor, C# lock is gaining on LockUC, but they both already burn almost all available CPU resources, effectively burning more CPU resources on synchronization than actually on doing some useful work! Suggesting that threading alone with thread suspend/resume approach won't work well beyond that line and other approaches must be considered.

Chart 3: LockUC - Heavy Contention Benchmark

Benchmark: LockUC - Bad Neighbor

Please remember, that the Bad Neighbor scenario is about the cumulative effects of multiple unrelated thread pairs each with minimal contention. The LockUC is losing throughput period a little on Monitor, C# lock, but for that, CPU performance is significantly improved. Where the throughput period would be an issue, other synchronization primitive should be used. Like SpinLockUC.

Chart 4: LockUC - Bad Neighbor Benchmark

Benchmark: SpinLockUC/AsyncSpinLockUC - Heavy Contention

The SpinLockUC is a thin wrapper around .NET SpinLock struct and is designed to offer the best throughput period for cases, where accessing threads have low probability to meet each other in the protected section and avoiding fairness helps to improve throughput period. If the operation in the protected section is very short, it is hard, even with 24 threads, to cause the situation, where multiple threads would be spinning for a longer time to gain access. The more probable situation is that before new access request comes, the previous access request will be already processed most of the times. It is an ideal situation, but the chart above says that 24 threads on the given box reach critical point when throughput period is worse than 324 µs where some threads under high contention after that point are effectively spending their time with spinning/waiting to gain access even for seconds if there are more threads attempting to gain access in same time because the access is not fair. This situation is a great problem on NUMA boxes because for some threads accessing certain memory is easier than for others based on metric of NUMA nodes distance of particular memory vs. CPU core which is amplifying issue of lack of fairness. The effect of Cross NUMA node access costs can be relatively measured by the tool CoreInfo from Mark Russinovich, Microsoft: https://docs.microsoft.com/en-us/sysinternals/downloads/coreinfo. The design requirement to seek the highest throughput period for SpinLock came with this price. Please understand that these limitations are not some errors left in the SpinLock implementation! These issues are intrinsic properties of atomic instructions based synchronization primitives where access fairness is not guaranteed! The future seems to be walking in the direction of more CPU cores and more NUMA nodes in the same box. That is a promise, that the future boxes will be pushing the critical point of throughput period way below 324 µs. We should be prepared and count with this because these issues are the reason why Red Hat Enterprise Linux already implemented a TicketSpinLock. It has little worse throughput period, but for that, the TicketSpinLock is capable to guarantee fair access which is important for better load balancing and is pushing the system closer to real-time operating systems qualities.

Unfair access can play a significant role in the unexplained momentary degradation of throughput period and we should keep in mind how to avoid it! We must make execution in the SpinLock entry very fast, ideally, only memory update instructions, avoiding slow computations and strictly avoiding thread suspension from I/O or kernel calls and if possible also avoiding memory allocation.

The unfairness of the SpinLockUC is clearly visible on Chart 5 because it is causing peaks on the throughput period statistics because different threads are playing starkly differently. Some threads are getting lots of entries and few no entries at all for ten seconds on throughput periods above 324 µs.

Chart 6 is showing only throughput period and specifically median and max throughput period where we can see the behavior of two distinct groups of threads, one that gets almost always entry, those are closer to the ideal throughput period and another closer to the max throughput period of 10 seconds for those that don't get so many entries or no entry at all! It has to be said, that attempting to execute long running code in the SpinLock is a just bad idea and Chart 5 and 6 show how pointless it is.

Chart 5: SpinLockUC/AsyncSpinLockUC - Heavy Contention Benchmark

Chart 6: SpinLockUC/AsyncSpinLockUC - Heavy Contention Benchmark - Unfairness - Median vs. Max Throughput Period

Benchmark: SpinLockUC/AsyncSpinLockUC - Bad Neighbor

The SpinlockUC based on the .NET SpinLock struct is playing pretty well in the case of the Bad Neighbor scenario because the contention is minimal yet still concurrent with a pair of threads. SpinLock is designed for situations where contention will be infrequent and execution inside locked entry will be kept extremely short. The CPU wastage can be furthermore lowered by balancing the code in the way that the execution path will take longer time outside locked entry than inside locked entry.

Chart 7: LockUC - Bad Neighbor Benchmark

Benchmark: TicketSpinLockUC/AsyncTicketSpinLockUC - Heavy Contention

The TicketSpinLockUC is a general TicketSpinLock implementation that will contend the atomic operation until it gets access based on the ticket order of arrival, thus the access is fair. There is no spin-waiting or yielding or sleeping like for the .NET Spinlock and for that, it is paying the heavy cost of atomic operation contention and takes a lot of CPU resources and because of the fairness, it is also losing the throughput period below 200 µs on a given box, but it is not important! It is meant for extremely short execution paths in the locked entry, ideally just a few instructions or updating some variables with certainty that fair access, the FIFO order, will be ensured. This is a synchronization primitive useful for load balancing algorithms with ensured thread starvation protection.

Chart 8: SpinLockUC - Heavy Contention Benchmark

Benchmark: TicketSpinLockUC/AsyncTicketSpinLockUC - Bad Neighbor

The TicketSpinLockUC implementation is showing improvement in the throughput period below 200 µs on given box as the number of threads is lowered for the Bad Neighbor scenario just to two threads per synchronization primitive. This is a great example, that throwing too much parallelism on a sequential problem is a bad idea and gives us the incentive to strike the right balance for the number of threads and ratio between execution time inside locked entry and outside locked entry where the latter should take longer time.

Chart 9: LockUC - Bad Neighbor Benchmark

Benchmark: AsyncLockUC - Heavy Contention

The AsyncLockUC is only real async/await synchronization primitive that is actually returning incomplete awaiter, timing and scheduling are then based on the ThreadPool. Obviously, if the ThreadPool threads are depleted, then the synchronization will be losing performance, but let's face it, that would also imply poorly designed software and suspend/resume synchronization techniques would lose performance in this case as well due to many active threads competing for the CPU time slot.

The AsyncLockUC seems to have interesting performance and specifically the CPU wastage is here very low. Access is fair. Locking inside internal data structures is lock-free based and as fast and as short as possible, to avoid any contention. This is only synchronization primitive in this report, that does not cause CPU gridlock => the moment when CPU is burning cycles, but the work being done is actually very small because the cost of synchronization is taking almost all the CPU resources.

The AsyncLockUC is demonstrating that it can handle any workload including peak Heavy Contention, it does not dissipate CPU resources needlessly and other services on the box have still available resources to carry on with their duties. The CPU usage is here below 42 % max!

The AsyncLockUC is as close to Ideal Synchronization Primitive as it can get because for throughput periods longer than 200 µs its behavior is fast approaching Ideal Synchronization's Primitive performance!

Chart 10: AsyncLockUC - Heavy Contention Benchmark

Benchmark: AsyncLockUC - Bad Neighbor

The AsyncLockUC seems to behave reasonably in the Bad Neighbor scenario as well.

Chart 11: AsyncLockUC - Bad Neighbor Benchmark

Cross-Benchmark: Monitor (C# Lock) & SpinLockUC & TicketSpinLockUC - Heavy Contention

Finally, we get to the first Cross-Benchmark! Monitor, C# lock, has hybrid nature so it is fitting to cross-examine its behavior with SpinLockUC and TicketSpinLockUC.

Monitor, C# lock is a thread suspend/resume synchronization primitive enhanced with the hybrid heuristical avoiding of the thread suspension with spin-waiting, but it is not free of charge! We can see, that below 1 ms throughput period SpinLockUC and Monitor, C# lock behaves very much the same with the CPU and the throughput period. TicketSpinLockUC with fair access is losing on them with the throughput period.

I believe that the chart is showing nicely relations and especially where to use what:

Obviously, SpinLockUC should be avoided for any throughput period above 324 µs, because thread starving is highly probable.
TicketSpinLockUC should be used only in case we need ensured fairness, like load balancing algorithms and only for very short operations in the locked area. I would suggest below 100 µs throughput period because otherwise, the CPU waste is not reasonable.
This leaves open question what would be a replacement for Monitor, C# lock between 324 µs and 32 ms throughput period?

Chart 12: Monitor (C# lock) & SpinLockUC & TicketSpinLockUC- Heavy Contention Cross-Benchmark

Cross-Benchmark: Monitor (C# lock) & SpinLockUC & TicketSpinLockUC - Bad Neighbor

The Cross-Benchmark of the Monitor, C# lock for the Bad Neighbor scenario is clearly showing where the SpinLockUC will find its best usage. In algorithms where the throughput period is short and contention is minimal. The SpinLockUC will be capable to save considerable CPU resources because the hybrid nature of the Monitor, C# lock in these cases is clearly wasting a lot of CPU resources.

Chart 13: Monitor (C# lock) & SpinLockUC & TicketSpinLockUC- Bad Neighbor Cross-Benchmark

Cross-Benchmark: Monitor (C# lock) & LockUC - Heavy Contention

Cross-Benchmark of the Monitor, C# lock and the reasonable replacement, the LockUC in the Heavy Contention scenario is answering what synchronization primitive to use between 324 µs and 32 ms throughputs periods. The LockUC clearly serves pretty well as a replacement because the CPU wastage is lower and it is getting worse way later than with Monitor, C# lock.

Chart 14: Monitor (C# lock) & LockUC- Heavy Contention Cross-Benchmark

Cross-Benchmark: Monitor (C# lock) & LockUC - Bad Neighbor

The Bad Neighbor scenario for Cross-Benchmark of Monitor, C# lock and LockUC with minimal contention is showing performance issues of the Monitor, C# lock because even for extremely short execution paths where LockUC's CPU waste is dropping, the Monitor, C# lock is still burning the CPU!

Chart 15: Monitor (C# lock) & LockUC - Bad Neighbor Cross-Benchmark

Cross-Benchmark: Monitor (C# lock) & LockUC & AsyncLockUC - Heavy Contention

Finally, we can examine the Cross-Benchmark of Monitor(C# lock), LockUC and objectively best synchronization primitive the AsyncLockUC in the Heavy Contention scenario.

If we have to stay in general threading environment, it is best to replace Monitor (C# lock) with LockUC, because it is wasting fewer CPU resources and actually even doing more work in same time in some configurations.

In case we can rewrite the code for Async/Await environment, we can choose the best option there is under GreenSuperGreen/Unified Concurrency and that is the AsyncLockUC which is as close to the Ideal Synchronization Primitive as it can get plus avoiding the CPU gridlock of other synchronization primitives!

There is some case that was carried out by the Monitor, C# lock where the low throughput period is necessary, disregarding the CPU wastage. I hope that I have sufficiently described with benchmarks, that the CPU wastage is insanely high to use the Monitor, C# lock in this case. Such throughput periods are achievable by the SpinLockUC/AsyncSpinLockUC, TicketSpinLockUC/AsyncTicketSpinLockUC and all the benchmarking suggests that these parts of the code require different thinking and design, otherwise they become a bottleneck.

Chart 16: Monitor (C# lock) & LockUC & AsyncLockUC- Heavy Contention Cross-Benchmark

Cross-Benchmark: Monitor (C# lock) & LockUC & AsyncLockUC- Bad Neighbor

The Bad Neighbor scenario should not be any surprise. The suggestions are the very same as for Heavy Contention scenario.

If we have to stay in general threading environment, it is best to replace Monitor (C# lock) with LockUC, because it is wasting fewer CPU resources and actually even doing more work in the same time in some configurations.

Chart 17: Monitor (C# lock) & LockUC & AsyncLockUC- Bad Neighbor Cross-Benchmark

Cross-Benchmark: Monitor (C# lock) & LockUC & AsyncLockUC & SpinLockUC & TicketSpinLockUC - Heavy Contention

Finally, we can examine the Cross-Benchmark of Monitor (C# lock), LockUC and objectively best synchronization primitive the AsyncLockUC in the Heavy Contention scenario.

Chart 18: Monitor (C# lock) & LockUC & AsyncLockUC & SpinLockUC & TicketSpinLockUC - Heavy Contention Cross-Benchmark

Cross-Benchmark: Monitor (C# lock) & LockUC & AsyncLockUC & SpinLockUC & TicketSpinLockUC- Bad Neighbor

The Bad Neighbor scenario should not be any surprise. The suggestions are the very same as for Heavy Contention scenario.

Chart 18: Monitor (C# lock) & LockUC & AsyncLockUC & SpinLockUC & TicketSpinLockUC - Bad Neighbor Cross-Benchmark

Summary

This article covered Benchmarks and Cross-Benchmarks for implemented synchronization primitives under GreenSuperGreen library of UnifiedConcurrency framework for two distinct scenarios, the Heavy Contention, and the Bad Neighbor.

We discussed advantages, disadvantages, costs, weaknesses, strengths and behavior of synchronizations primitives individually and comparatively between each other based on the scenarios and circumstances.

Revision History

28^th June, 2018: Complete Cross-benchmark: Monitor (C# lock), LockUC and AsyncLockUC, SpinLockUC, TicketSpinLockUC