Introduction
I’ve recently wanted to demonstrate the relevance of the .Net ReaderWriterLock[Slim] synchronization primitives.
It’s good to hear from the vendor that it’s better, faster, stronger, but when you can it’s always good to evaluate it yourself; not that I don’t trust vendors, but because I like to have hard numbers, particularly when I assert something that can be critical for my participants’ developments.
So I’ve built a small, simple and I hope relevant benchmark to measure the performance impact of ReaderWriterLock[Slim] compared to the naive and uniform use of a Monitor using the C# lock construct.
I wanted to check these two things:
- that the RW locks behave as advertised,
- what is the profile of the gain function.
In this article I’ll explain the rationales behind the benchmark, how I’ve implemented it and finally present the results.
Benchmark presentation
Workload
The work of each thread is simulated by interrupting them using Thread.Sleep during 1ms, a mid value between heavy operations that would last seconds and light ones that would last 1µs.
The proportion/probability of writes operations, p, varies linearly in a given range, typically from 0% (only reads) to 100% (only writes).
If you have specific workload profiles you can adapt this part:
- by using a sub-range, e.g. if you know the reads and writes operations are balanced,
- by overweighting either the read or write part duration if you know one is longer,
- or even by randomizing them according to the probability distribution of your workload.
General workflow
The benchmark scans the range of levels of concurrency.
For each level of concurrency, c, it scans the range of writes proportions, p.
For each pair (c, p) it runs c threads that will each execute a loop of n operations.
For each iteration, the operation to simulate (read or write) is chosen randomly according to the proportion of writes p.
To randomly generate the operation type a random integer is generated with Random.Next(), then normalized with int.MaxValue to obtain a number linearly distributed in the range [0, 1].
If this number falls into the [0, p] range then we simulate a write, otherwise we simulate a read.
Threads synchronization
To ensure that for a given combination (c, p) all the threads start at the same time, and that none can starts, runs or even ends before others have been given a chance to run, they are synchronized using a Barrier, another synchronization primitive which ensures that a set of threads will reach the same point of execution before continuing.
When all the threads are ready, the main thread gives the go-ahead and starts to measure the total time it takes for all the threads to execute all their operations using a Stopwatch instance.
Benchmark implementation
The benchmark component’s code
The benchmark itself has been reified with a class to allow for an easy configuration: the range of p, the range of c and n are passed to the single benchmark’s constructor.
The benchmark class produces the results but does not display them, which is the responsibility of the caller.
Here is the benchmark class itself:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Threading;
namespace ReaderWriterLockBenchmark
{
public class ReaderWriterLockBenchmark
{
private readonly int n;
private readonly decimal pMin, pMax, pStep;
private readonly int threadCountMin, threadCountMax, threadCountStep;
private decimal p;
private Barrier barrier;
private readonly object @lock = new object();
ReaderWriterLockSlim rwLock = new ReaderWriterLockSlim();
void SimpleLocking()
{
barrier.SignalAndWait();
for (int i = 1; i <= n; ++i)
{
lock (@lock)
{
Thread.Sleep(1);
}
}
}
void RWLocking()
{
barrier.SignalAndWait();
Random rand = new Random();
for (int i = 1; i <= n; ++i)
{
if (1.0m * rand.Next() / int.MaxValue <= p)
{
try
{
rwLock.EnterWriteLock();
Thread.Sleep(1);
}
finally
{
rwLock.ExitWriteLock();
}
}
else
{
try
{
rwLock.EnterReadLock();
Thread.Sleep(1);
}
finally
{
rwLock.ExitReadLock();
}
}
}
}
public ReaderWriterLockBenchmark(int n, decimal pMin, decimal pMax, decimal pStep, int threadCountMin, int threadCountMax, int threadCountStep)
{
this.n = n;
this.pMin = pMin;
this.pMax = pMax;
this.pStep = pStep;
this.threadCountMin = threadCountMin;
this.threadCountMax = threadCountMax;
this.threadCountStep = threadCountStep;
}
private IDictionary<int, TimeSpan> MeasureReferenceTimes()
{
IDictionary<int, TimeSpan> simpleLockingTimes = new Dictionary<int, TimeSpan>();
for (int threadCount = threadCountMin; threadCount <= threadCountMax; threadCount += threadCountStep)
{
Console.WriteLine("Now with {0} threads.", threadCount);
Thread[] threads = new Thread[threadCount];
barrier = new Barrier(threadCount + 1);
for (int i = 0; i < threadCount; ++i)
{
Thread thread = new Thread(SimpleLocking) { IsBackground = true };
threads[i] = thread;
thread.Start();
}
Stopwatch stopwatch = Stopwatch.StartNew();
barrier.SignalAndWait();
for (int i = 0; i < threadCount; ++i)
{
threads[i].Join();
}
stopwatch.Stop();
simpleLockingTimes[threadCount] = stopwatch.Elapsed;
}
return simpleLockingTimes;
}
private IDictionary<int, IDictionary<decimal, double>> Benchmark(IDictionary<int, TimeSpan> simpleLockingTimes)
{
IDictionary<int, IDictionary<decimal, double>> results = new Dictionary<int, IDictionary<decimal, double>>();
for (int threadCount = threadCountMin; threadCount <= threadCountMax; threadCount += threadCountStep)
{
Console.WriteLine("Now with {0} threads.", threadCount);
results[threadCount] = new Dictionary<decimal, double>();
Thread[] threads = new Thread[threadCount];
Stopwatch stopwatch = new Stopwatch();
for (p = pMin; p <= pMax; p += pStep)
{
barrier = new Barrier(threadCount + 1);
for (int i = 0; i < threadCount; ++i)
{
Thread thread = new Thread(RWLocking) { IsBackground = true };
threads[i] = thread;
thread.Start();
}
stopwatch.Restart();
barrier.SignalAndWait();
for (int i = 0; i < threadCount; ++i)
{
threads[i].Join();
}
stopwatch.Stop();
TimeSpan rwLockingTime = stopwatch.Elapsed;
double ratio = 1.0 * rwLockingTime.TotalMilliseconds / simpleLockingTimes[threadCount].TotalMilliseconds;
results[threadCount][p] = ratio;
}
}
return results;
}
public IDictionary<int, IDictionary<decimal, double>> Run()
{
IDictionary<int, TimeSpan> simpleLockingTimes = MeasureReferenceTimes();
IDictionary<int, IDictionary<decimal, double>> results = Benchmark(simpleLockingTimes);
return results;
}
}
}
The benchmark application’s code
And here is the code that uses the benchmark and renders the results as a CSV stream in the console:
using System;
using System.Collections.Generic;
namespace ReaderWriterLockBenchmark
{
class Program
{
static void Main(string[] args)
{
int n = 200;
decimal pMin = 0m;
decimal pMax = 1m;
decimal pStep = 0.1m;
int threadCountMin = 1;
int threadCountMax = 8;
int threadCountStep = 1;
ReaderWriterLockBenchmark benchmark = new ReaderWriterLockBenchmark(n, pMin, pMax, pStep, threadCountMin, threadCountMax, threadCountStep);
IDictionary<int, IDictionary<decimal, double>> results = benchmark.Run();
for (int threadCount = threadCountMin; threadCount <= threadCountMax; threadCount += threadCountStep)
{
Console.Write("\t{0}", threadCount);
}
Console.WriteLine();
for (decimal p = pMin; p <= pMax; p += pStep)
{
Console.Write(p);
for (int threadCount = threadCountMin; threadCount <= threadCountMax; threadCount += threadCountStep)
{
Console.Write("\t{0:N3}", results[threadCount][p]);
}
Console.WriteLine();
}
}
}
}
From there the output can be copy-pasted in your favorite tools for analysis, typically Excel.
Note that with this design you could go a step further and instead of updating the parameters in the Main
method you would pass them as parameters of the executable, parsing them in the Main
and forwarding them to the benchmark instance.
Results
I’ve run the benchmark with 1 thread to 8 threads and with 50 threads, and with p varying from 0% to 100% increasing by step of 10%.
Here are the raw results:
p\n | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 50 |
---|
0.0 | 1.003 | 0.502 | 0.335 | 0.249 | 0.200 | 0.167 | 0.145 | 0.128 | 0.020 |
0.1 | 0.997 | 0.658 | 0.500 | 0.404 | 0.347 | 0.299 | 0.287 | 0.197 | 0.111 |
0.2 | 0.995 | 0.741 | 0.606 | 0.540 | 0.424 | 0.410 | 0.400 | 0.360 | 0.185 |
0.3 | 0.999 | 0.838 | 0.684 | 0.550 | 0.573 | 0.504 | 0.522 | 0.462 | 0.320 |
0.4 | 0.998 | 0.840 | 0.755 | 0.698 | 0.602 | 0.597 | 0.581 | 0.539 | 0.379 |
0.5 | 0.999 | 0.898 | 0.811 | 0.761 | 0.750 | 0.688 | 0.650 | 0.655 | 0.500 |
0.6 | 0.997 | 0.923 | 0.880 | 0.823 | 0.775 | 0.699 | 0.765 | 0.669 | 0.573 |
0.7 | 0.998 | 0.973 | 0.910 | 0.860 | 0.804 | 0.788 | 0.801 | 0.787 | 0.723 |
0.8 | 0.995 | 0.960 | 0.928 | 0.894 | 0.890 | 0.882 | 0.853 | 0.854 | 0.779 |
0.9 | 0.995 | 0.981 | 0.965 | 0.935 | 0.948 | 0.938 | 0.929 | 0.912 | 0.938 |
1.0 | 0.999 | 1.001 | 1.000 | 0.996 | 1.001 | 1.001 | 1.002 | 1.003 | 1.000 |
And plotted:
A basic analysis shows that the main characteristics were expected:
- when we only perform reads we’re as fast as we can: with c threads it’s c times faster
- when we only perform writes there is no gain
What is more interesting and that I personally didn’t expect is that this function is concave.
It’s linear only asymptotically when n tends towards infinity (and even with 50 threads this is already quite linear).
I honestly have not the skills in queuing theory to fully explain the results but I guess this can be modelized this way:
And with normalized time:
With a random variable representing the time a thread will wait in the ReaderWriterLockSlim waiting queue.
If p = 0% then will be 0 and the ratio will always be .
If p = 100% then will be n – 1 because each time a thread enters the queue it will have to wait for all the other threads to finish, and the ratio will always be .
Hopefully this is consistent with our empirical results.
And finally according to our results with increasing n should tend to , or something similar so that the final ratio tends to p.
Conclusion
Concerning the efficiency, we’ve confirmed that using a finer locking policy, distinguishing between reads and writes operations, allows for a significant gain in performance.
Again this is not an absolute result as it could greatly vary depending on your use case but it’s consistent with what we expect for such a synchronization primitive.
As for the gain function its profile was less expected: it’s not linear but tends to be with increasing concurrency.
Then the theoretical gain if you have infinite concurrency is , whatever p.
If you have an idea of what the random variable looks like please share by letting a comment, it would be interesting to be able to compute the theoretical results and to match them against the empirical results presented above.
I hope there is no flaw in the benchmark that would explain the strange profile, but I’d like to be sure.
If you catch any typo or mistake, have additional questions, some remarks, feel free to let a comment.
To make it short I'm an IT trainer specialized in the .Net ecosystem (framework, C#, WPF, Excel addins...).
(I'm available in France and bordering countries, and I only teach in french.)
I like to learn new things, particularly to understand what happens under the hood, and I do my best to share my humble knowledge with others by direct teaching, by posting articles on my blog (pragmateek.com), or by answering questions on forums.