Fast critical sections with timeout

valdok

Rate me:

4.85/5 (49 votes)

26 May 2007CPOL20 min read

235.1K

A faster and lighter ciritcal section implementation.

Download source - 2.7 KB

Introduction

This article observes critical sections: what are they and what are they for, and how they're implemented in Windows. Next, we'll see an implementation of critical sections with some performance advantages and extra functionality (timeout).

Readers who are not very experienced in multi-threaded applications will find this article educational, since it covers the synchronization problems in all the details. Those who are experienced in multithreaded, high-performance applications may find my optimization useful in some cases, though it's not a revolutionary performance breakthrough (critical sections are already fast enough, in most of the cases). Those can skip right to the final section.

What is synchronization?

In operating systems such as Windows and Linux, the decision to grant a processor time to a particular thread is up to the system. An application is not allowed to interfere the threads scheduling mechanism. Even assigning different priorities to threads does not guarantee the order of execution. In particular, two consequent processor instructions are not guaranteed to execute without preemption by another thread.

In multi-threaded applications, there's usually a need to access some resource from within different threads. Depending on the type of the resource/object we access, we may or may not admit simultaneous access to it. For example, there's no problem when multiple threads read some global parameter simultaneously. There's, however, a problem if you want to append a node in a shared linked list, for example. Inserting a node into a linked list involves several steps, hence we can be preempted in the middle, and until we resume, the linked list is left in the inconsistent state. We say that insertion into a linked list is not an atomic operation (atom, in Greek, means non-dividable).

So, we need a solution. As was said already, we can not "ask" the system not to preempt us until we finish the operation. It may seem that we can solve this simply by keeping a boolean variable, which we set to true when we access the shared resource, and set to false when we finish; then, before we access it, we may check this variable. But, this naive method does not solve the problem: querying a variable and then modifying it is not atomic, you can be preempted after querying the variable and just before modifying it.

The correct solution was designed into the processors. Modern processors have some instructions that perform several steps, and still execute as atomic operations. For example, on i386 processors (modern Intel and AMD processors are their successors), there're xadd and cmpxchg instructions. xadd adds a specified value to the operand, and then returns its previous value. cmpxchg modifies the operand to a specified value if the operand was equal to another specified value. But, what makes those instructions so valuable for us is the ability of the processor to execute them atomically. That is, a thread can not be preempted in the middle so that xadd did only a part of the job.

So far, we have atomic operations that perform several steps; this gives us immunity against the threads scheduling mechanism. But, there's also another factor we must take into account: multi-processor systems. In machines where several processors (or other on-board devices) work with the same memory location, extra care must be taken to restrict simultaneous access to the same variable. Because of this fact, those atomic instructions we've discussed earlier should be executed with the lock prefix, which signals all other devices in the system that a specific memory location is being modified. This affects the performance (tens to hundreds of cycles, depending on the processor), but this is the only way to synchronize between processors. It may seem for some that multi-processor systems are something very special intended for 'monster' servers, not seen every day, but this is not true; today's modern home computers are already dual or quad processor, soon there won't be such a thing as a single-processor computer.

Such operations are called interlocked, those that restrict simultaneous access to a variable. And, they are the basis for all the synchronization mechanisms in all operating systems. (In kernel mode, there's a mechanism that guarantees the order of execution, but still the lock prefix is needed in multi-processor machines.)

Synchronization in Win32 API

In Win32, interlocked functions are available through the standard API (exported by kernel32.dll); you don't have to be an assembler guru to work with them. All of them have the InterlockedXXXX prefix. Examples are InterlockedIncrement, InterlockedDecrement, InterlockedExchangeAdd, and etc.

All this is very good, but let's get down from the skies back to our goats. How can all this help us to add a node to a shared linked list? Up until now, we haven't seen something like InterlockedAppendNodeToMyLinkedList. So, let's try to implement it using the functionality we have.

A simple way to do this is the following: initialize some variable, say nLockCount, by some known value, say 0. Next, at the beginning of the function, put the following loop:

while (InterlockedCompareExchange(&nLockCount, 1, 0));

The first caller of InterlockedCompareExchange will set the nLockCount to 1 and receive 0, all others will receive 1 without affecting nLockCount. Even if two threads, and even on different processors, enter this loop simultaneously (nearly), only one of them will receive 1. After the loop is performed, we can do whatever we want: add nodes to a linked list, reallocate buffers, everything. Does it mean that we can't be preempted by another thread? Of course, no. The system runs its scheduling mechanism as usual. But, if we're interrupted and some other thread calls our function - it will just hang on this loop. Until eventually our thread is resumed, finishes its work (whatever it is), and sets the nLockCount back to 0. After this point, one of the "waiting" threads (if there is one) will finish the loop when it is resumed, and so on.

We say that we've implemented a synchronization mechanism, and nLockCount is our shared synchronization object. A situation in which a thread attempts to lock an object that is already locked by another thread is called synchronization collision. Our mechanism detects such collisions, and prevents simultaneous access to the shared resource. Let's now analyze the advantages and drawbacks of what we've just done.

First, suppose, from within a locked section (between the locking loop and setting nLockCount back to 0), we call another function, which, in turns, wishes to lock the same resource. The locking loop will hang because the nLockCount is already 1, and there's no one that's going to unlock it. So, we have a deadlock. There're two things we can do about it: either always be aware about which synchronization objects we've already locked before calling your functions, or we can enhance our synchronization mechanism to remember which thread owns it, hence skipping lock/unlock if it is already owned by the calling thread.

(This doesn't mean we've solved the deadlock problem completely. Suppose a thread owns A and waits for B, whereas another thread owns B and waits for A. Deadlock problems may arise from incorrect general design, and they can't be solved by "enhancements" of the synchronization objects themselves. There're different models of how to design the whole application correctly, we'll skip the discussion about this because it's endless.)

Now, about the performance. Our synchronization object is very light in terms of the memory it consumes: just a single 32 bit variable, or two variables in case we want to remember which thread owns us. Locking/unlocking it is also fast. Well, not faster than the lock prefix can afford, but that's what we have. But, what is the cost of the synchronization collision for our method? On collision, we just wait in the loop until the object is unlocked, this is called spinning. This is OK if we're working on multi-processor systems and usually locking our object for small time durations, so that after several attempts, the thread is likely to acquire the lock. But, in other cases, we have a situation where the OS gives a processor time to a thread, and all this valuable time is just spent in spinning, without reasonable chance (on single-processor machines - without any chance) to succeed. Isn't it better to "tell" the OS somehow that we don't need the processor time right now?

This can be easily fixed. Modify the locking loop so that if you fail to acquire the lock - put Sleep(0), this will tell the OS that we don't need the rest of our processor time portion. Well, this helps, but the reality is even more cruel. Attaching/detaching a thread for execution, the so-called context-switch, is a complex operation for the OS, and it costs the processor time. Although we saved the rest of the processor time from obsolete spinning, still it would be better to let the system know preliminarily that our thread should not execute right now, to omit the context-switch to it. In order to implement this, there should be some agreed mechanism that informs the system about our synchronization pattern.

Win32 synchronization objects

We can make the OS aware of our synchronization intentions by using its synchronization objects. There're different types of these objects, the most common are events, mutexes, semaphores, and critical sections. They're created by CreateEvent, CreateMutex, CreateSemaphore, and InitializeCriticalSection, respectively. I will not get deeply into the details of every one of them, they are well documented in the MSDN. I'd just like to point to some of their key differences:

Only critical sections and mutexes "remember" which thread owns them. (By the way, that's the only difference between a mutex and the auto-reset event.) They are intended mainly for the situation we've described: per-thread locking of a shared resource. Other objects may be used for this purpose too; however, that's not their intended use.
All of them, except critical sections, are kernel objects (hence, you free them by CloseHandle). Below, we'll discuss in depth what it means for us.

To acquire kernel synchronization objects, there're the WaitXXXX (and MsgWaitXXXX) functions in the Win32 API. The most basic is WaitForSingleObject. If the object(s) is available, then it is acquired in-place, and the function returns immediately. Otherwise, the OS suspends the caller thread and gives the rest of its processor time to another one. Furthermore, the OS will not execute this thread unless the object(s) it's waiting for can be acquired.

It is also worth to note that the same kernel object may be accessed from different processes (applications), thus enabling synchronization between applications. (In fact, even non-kernel mode objects can be shared between processes using a shared memory mechanism.)

Sounds good. But unfortunately, we have the drawbacks too. Any operation on a kernel object is performed by the system in the kernel mode, leading to a kernel-mode transaction, which has significant implications. Even if the object is available and acquired in-place by WaitForSingleObject, calling this function too often can lead to a serious performance hit. Interlocked operations cost us from tens to hundreds of processor cycles, whereas every kernel-mode call, including WaitForSingleObject and ReleaseMutex, for example, cost thousands of cycles, so that on multi-processor systems for short-time locking - spinning may be a preferred way to go.

Critical Sections

Minding all the above, we introduce critical sections. It's a hybrid. When you attempt to lock it (call EnterCriticalSection), the idea is to perform the following steps:

Check if this thread already owns the critical section. If it does, the lock/unlock is omitted (skip the rest).
Attempt to lock a dedicated variable via the interlocked instruction (similar to what we've done). If the lock succeeds, return (skip the rest).
Optionally, retry the second step a number of times. This can be enabled by calling InitializeCriticalSectionAndSpinCount instead of InitializeCriticalSection. This step is always skipped on single-processor machines.
After we've tried all of the above, call a kernel-mode WaitXXXX function.

In such a way, we can achieve the highest possible performance: lock as fast as possible if no collision, and also suspend for as much as it takes if the critical section is busy.

Note that implementing this, however, is not as trivial as it may seem. For example: when unlocking, you must somehow know that someone is already waiting, to signal the kernel object for it. Also, depending on how exactly you implement it, the successful return of the WaitForSingleObject function may not mean yet that the critical section is acquired; you may need to repeat all the steps again.

And also, there's a choice here of performance vs. fairness: if there're threads currently suspended, then the critical section is eventually unlocked; but, before the system schedules one of those threads, comes another one. Should it immediately acquire the lock (performance), or just wait like all the others (fairness)?

In Windows API, the answer on this question is the performance priority (comes out from experiments). And, that's logical in my opinion: the whole idea of critical sections is to achieve maximal performance. Also, the WaitXXXX functions do not guarantee absolute fairness.

Also, in Windows, the kernel waitable object is created only upon the first collision, not straight at the beginning.

Sounds good so far. So, what can be optimized then?

Optimizations

Let's now see in-depth what we have already.

First of all, Windows interlocked functions are very fast. Furthermore, the Microsoft VC++ compiler supports intrinsic interlocked functions (see the underscored versions, _InterlockedXXXX in MSDN). Actually, I could not squeeze any more, even at the assembler level.

So, what's wrong with the standard Windows critical sections?

Let's look at the definition of the CRITICAL_SECTION C/C++ structure (defined in winnt.h). That's how it is defined:

typedef struct _RTL_CRITICAL_SECTION {
    PRTL_CRITICAL_SECTION_DEBUG DebugInfo;

    //
    //  The following three fields control entering and exiting the critical
    //  section for the resource
    //


    LONG LockCount;
    LONG RecursionCount;
    HANDLE OwningThread;        // from the thread's ClientId->UniqueThread

    HANDLE LockSemaphore;
    ULONG_PTR SpinCount;        // force size on 64-bit systems when packed

} RTL_CRITICAL_SECTION, *PRTL_CRITICAL_SECTION;

Some explanation about the members: LockCount is the parameter to the interlocked operations, OwningThread is the owning thread ID (although declared as HANDLE, it is a DWORD), RecursionCount specifies how many times the owning thread has acquired the critical section, SpinCount is the number of additional attempts that should be performed before going into the suspended state, and the LockSemaphore is the kernel object handle - however, it seems to be an auto-reset event rather than a semaphore.

In addition, there's a DebugInfo member, which is our first problem. In fact, when you call InitializeCriticalSection besides initializing the variables of the structure, the Windows API also allocates an additional structure and stores a pointer to it in the DebugInfo member. In such a way, Windows tracks all your critical sections. The question is, what is this for? It is stated that in such a way, Windows can automatically detect deadlocks in the program. In my opinion - stupid idea. At least, there has to be an option to initialize a critical section without this DebugInfo.

First of all, I've never seen this deadlock detection mechanism working, even when a deadlock really happened (should it be enabled somehow?). Also, tracking all the critical sections alone does not do the deadlock detection, since there may be other synchronization objects too. If so, the deadlock detection should rather be something implemented in the kernel mode. But anyway, there's no ultimate solution for this: an application can also use infinite spin-locks, it's not forbidden. In my opinion, deadlocks should be prevented from the beginning by correctly designing the application. That's something you may need in debug time only.

The CRITICAL_SECTION structure consists of six 32 bit members, but because of the DebugInfo, the real memory consumption is more than 24 bytes. RTL_CRITICAL_SECTION_DEBUG takes another 32 bytes, plus it seems to be allocated on heap, which makes memory consumption even larger. From my experiments, it's about 100 bytes.

Also, creating/destroying such critical sections is not zero time-consuming. Allocating/freeing memory on the heap takes some time, plus it needs additional synchronization.

This impact is insignificant for applications that work with some constant set of critical sections, but in some cases, you need to create/destroy them on-the-fly. That's where we have a performance hit.

Next, I noticed that calling EnterCriticalSection leads to a single interlocked instruction when there's no collision, and this is OK. What was a bad surprise for me was that calling LeaveCriticalSection also involves an interlocked instruction. This is a pity, because it is possible to implement a critical section so that unlock will not use the lock prefix at all.

I believe that this is so because of historical reasons, and up until now, no one has fixed it. On early 386 processors, there was no cmpxchg instruction. Instead, you could only use inc and dec with the lock prefix. You could not check the variable and modify it only if it equals to something in an atomic operation. Also, you could not retrieve its previous value (like xadd offers). Instead, you could only know if upon inc/dec, this variable became negative, positive, or zero, plus its parity (by testing the CPU flags). Note also that InitializeCriticalSection sets the LockCount member to -1. Why? Because, executing lock inc on it will set the ZR CPU flag to 1 for the first caller, others will get 0.

But, a real bad surprise for me was when I noticed that even recursive lock/unlock of the critical section leads to interlocked instructions. What for? You know what, I could sincerely forgive all of the above, but this barbarity crosses all the barriers. The only explanation for this I can ever imagine is that critical sections were designed in such a way, that you can enter it from one thread, and leave in another. But, what kind of a programmer would do such a thing? Critical sections (and mutexes) are especially designed to be a per-thread locking object.

Plus, there's no timeout option designed. You can either specify to wait forever, or try to lock without waiting at all (TryEnterCriticalSection). Why not implement it too, if it's not too complex?

Alternative implementation

Minding all of the above, I've decided to write my own critical section. I'll not get too deep into the code discussion, we've already discussed this plenty. It's possible (though a bit tricky) to figure out how it works. Just to note: implementing a timeout (without performance degradation) was not so trivial as it may seem for many.

Let's just point to some key features of it.

A critical section is encapsulated in a CritSectEx C++ object. On construction, you may set the maximal spin count for it (like InitializeCriticalSectionAndSpinCount). You can also change it later by the SetSpinMax method; even in the middle of the work, it's absolutely safe.

We've been told also that on collision, we create a kernel synchronization object silently, but you can also pre-create it from the beginning if you want, by AllocateKernelSemaphore. (This functionality is available in Windows, hence I decided to provide it too.) Again, it's safe to call it during the use.

You can use this critical section object in a traditional way: call Lock and Unlock on it; however, I wouldn't recommend that. Instead, it is better to use a Scope locking technique, something like this:

// in some place you want to use a critical section declared as myCs
{
    CritSectEx::Scope scope(myCs);
    // access the shared resource
    // ...

}
//Here the critical section has been unlocked

To use it with timeout, you can write something like this:

{
    CritSectEx::Scope scope(myCs, 500); // 500 milliseconds timeout

    if (scope)
    {
        // access the shared resource
        // ...

    }
}

In my opinion, this should the the correct way to work with critical sections. Cases where lock and unlock should be separated are pretty rare. And even there, the Scope gives you flexibility: you may call Lock and Unlock explicitly on the Scope variable. For example: you want to lock a CS in a function, and then unlock it somewhere after it returns. This can be done in the following way:

void SomeFunc(CritSectEx::Scope& scope)
{
    // ...

    // at some point acquire the lock

    scope.Lock(myCs);
    // ...

}
void OtherFunc()
{
    // ...

    // at some point call SomeFunc.

    CritSectEx::Scope scope; // empty constructor doesn't lock anything

    SomeFunc(scope);
    // the CS is still locked

    // ...

}
// The CS is finally unlocked when OtherFunc returns

The idea is to avoid situations where you 'forget' to unlock the CS. Why not make the C++ compiler do this job? It is responsible for tracking the lifetime of C++ objects and calling their destructors, when necessary. Hence, we only need to ensure the unlocking in the destructor, and make the lifetime of the Scope appropriate.

Anyway, you can also use explicit Lock and Unlock. However, it's a bit more complex than in the standard critical sections: Upon return, Lock passes you an extra parameter that you should later pass to Unlock. This is so because, I've omitted one extra parameter in my implementation, something similar to RecursionCount in the standard critical sections. This is justifiable.

Here, I give some numbers for performance comparison of my critical sections with the standard Windows ones. The results are pretty accurate, though every test in a multi-threaded environment depends on dozens of miscellaneous factors. Tests were performed on Intel® Pentium® 4 and Intel® Core™ Duo processors. Besides, we have two operating systems: Windows Server 2003 Standard Edition, and Windows XP Professional.

These are the results for the standard critical sections:

OS	CPU	Init + Uninit (cycles)	Lock + Unlock (cycles)	Recursive Lock + Unlock (cycles)	Memory consumption (bytes)
Server 2003	P4 D	977	250	138	100
Professional	P4 (earlier)	7660	401	388	100
Professional	Duo	6672	85	90	100

First of all, there's a clear difference between the two operating systems. On Windows XP Professional, init + uninit is extremely slow. Well, this has an explanation: on Windows XP, heap functions are implemented in the kernel mode, yet another barbarity. Besides, on Server 2003, recursive lock + unlock is optimized (a single interlocked operation).

We also see a difference between the processors. The Intel® Core™ Duo processor is really impressive. Although with a bit lower clock speed, it is a real monster (in fact - two monsters). Interlocked operations are greatly optimized, which is important for processors cooperation. (This is not an Intel advertisement; in fact, I haven't tested it on AMD processors, maybe they give even better performance.) Kernel mode transactions, however, are still slow.

Now, let's see our critical sections in action:

OS	CPU	Init + Uninit (cycles)	Lock + Unlock (cycles)	Recursive Lock + Unlock (cycles)	Memory consumption (bytes)
Server 2003	P4 D	5	116	16	16
Professional	P4 (earlier)	8	198	15	16
Professional	Duo	5	45	1	16

There's no significant difference between operating systems, which is expected. The only significant operation is the non-recursive lock + unlock, which is twice faster than what the standard critical sections offer. All other operations are negligible, in comparison.

And, again we see a clear advantages of the Intel® Core™ Duo processor. It may seem that with such a processor our optimization becomes less significant, but this is not accurate. Yes, it executes interlocked operations in less cycles, but there're many other instructions that it probably executes faster too, so the relative impact on the performance may be nearly the same.

Conclusion

Some will probably laugh. "What is this for?", "You will never be able to see the difference", "Better to use something standard", "Yet another bicycle inventor". I disagree, and there's no point to argue about this. Take into account that in a single cycle, modern processors execute several instructions, and you'll see that an interlocked operation may cost about a thousand (!!!) regular instructions. When you write a hi-performance application that should squeeze the maximum from the computer, you have to care about how to enable the CPU to give its maximum. In particular, sometimes, it is worth to use a per-thread heap (without serialization), or even consider non-thread-safe reference counters for objects unless they are really referenced in different threads. Some believe that writing a program in a multithreaded way already makes it faster. Not true, my friend, improper design kills the performance totally, it'll work even slower than if it would be single-threaded.

Believe it or not, but once I managed to optimize a server so that it became 30 times faster by just removing/caching heap allocations, omitting kernel-mode transactions, and etc. No one could believe me until I demonstrated the numbers. That's the sad truth: our computers are super fast; unfortunately, we don't usually use them correctly.

Back to our critical sections. Well, not a breakthrough in most of the cases, I admit. But:

There're situations where the impact may be significant.
Nothing to lose anyway, we have no disadvantages compared to standard critical sections.
Timeout ability.

Well, some will probably value the timeout (unless those who are already decided to work with mutexes because the standard critical sections lack timeout, and see no significant benefits in critical sections).

I will appreciate comments, both positive and negative. But, I say it again: please don't try to convince me that there's no significant performance difference.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

valdok

Software Developer (Senior)

Israel

My name is Vladislav Gelfer, I was born in Kiev (former Soviet Union), since 1993 I live in Israel.
In programming I'm interested mostly in low-level, OOP design, DSP and multimedia.
Besides of the programming I like physics, math, digital photography.

This is a good comment, thanks. Whereas I disagree with many of your suggestions - some are good, and one is even critical.

1. I was not familiar with the Hand-off mutex problem. After your comment I took some time for reading about it. As far as I understood this is the scenario: a thread that releases a synchronization object immediately 'passes' it to the already-waiting thread, even if it has lower priority. Than if the thread wants it back - it have to wait for the lower-priority thread to resume, unlock the mutex/cs, and the context-switch back to our thread.

This is not an issue in my implementation. The Unlock does not pass the ownership of the critical section to anyone. It does notify the waiters (if there're any) that the critical section has just been released by setting the event, but that's it. The critical section is not locked by any thread, successful return of WaitForSingleObject doesn't grant access to the critical section: the PerfLock than attempts to lock as usual, as you can see. The first caller of PerLockImmediate wins. In the scenario where your thread releases a critical section and after a short period of time tries to lock it - most likely it will succeed, despite already-waiting threads. It won't even go into kernel mode, which is also a great advantage.

2. About the spinning. You tell to place a call to SwitchToThread between locking attempts. No Way! You forget that this is a kernel-mode function. And if you call one - why not just call WaitForSingleObject and wait in a regular way until the cs is released ? (I've already explained this in another comment)
I think you just misunderstand the whole idea of spinning in critical sections. The idea is to avoid kernel-mode transactions as much as possible, otherwise you don't need the critical section at all, use mutexes instead. Spinning is something that improves performance on MP systems for critical sections that are held for short durations. The spin count should be relatively small compared to kernel-mode transactions, so that even if the spin-lock doesn't succeed - there's no big performance degradation. If it does succeed however - you save the whole kernel mode call, which is a significant gain.

3. pause x86 instruction. Didn't know this, thanks. As far as I understood it helps to improve the pipelining on hyper-threaded processors. Since a single hyper-threaded processor 'runs' several logical threads - it is good to hint that our thread is currently 'waiting' without involving the kernel.

But IMHO - the advantage is minor in our case. I'll explain my point:
In kernel mode there's no such a thing as critical sections. You don't need such a hybrid there to save kernel-mode calls, since there're no kernel-mode transactions. Another difference is that kernel-mode threads that run on a raised IRQL can not go into suspended state. By design they must not allow lower-IRQL threads to preempt them, and higher-IRQL threads will preempt them automatically. Because of this the main synchronization function in kernel is KeAcquireSpinLock, which is infinite spin-lock.

That's where the pause instruction is effective I think, on long-duration spin-lock, where a probability of a single spin to succeed is low. Hence we'd like to simulate suspending of the caller thread to give some priority to the second 'virtual processor'.
For our critical sections however spin-locks are short, I woudldn't put more then 10-30 cycles. And IMHO it is questionable wether it really worth to put pause after unsuccessful lock attempt. Anyway we're going to enter the long-duration anabiosys (WaitForSingleObject) very soon.

4. You insist on modifying PerfLockImmediate to check the m_nLocker before calling _InterlockedCompareExchange since "it is very heavy on modern architectures". BTW another reader has already suggested it. I can agree with this just in a single place: when you call Lock with timeout=0 (TryEnterCriticalSection), to get the negative result faster. In all other cases I wouldn't do this. Oh yes, interlocked operations are heavy (I repeated this in the article many times), but kernel-mode transactions are much, much heavier. And this is the point you seem to miss completely. If extra call to _InterlockedCompareExchange helps to avoid a kernel-mode transaction in 5% of the cases - that's what you do.
Anyway I'm going to fix the article to check the m_nLocker when called with timeout=0.

5. About the memory barriers. Here you seem to be right.
Before reading your comment I was pretty sure that 'memory barriers' is something applicable for compile-time only, hence placing either _ReadWriteBarrier() or declaring the variable volatile is enough. Now I took time reading about this and discovered that I was wrong: a processor is allowed to reorder some memory read/writes under some circumstances. And you found the weak point in my implementation: It is PerfUnlock that should set both m_dwOwner and m_nLocker to 0 in the strict order. If the order is reversed - the CS is unlocked (m_nLocker is 0), eventually another thread acquires it and sets m_dwOwner to its ID, than this value is overwritten to 0 by the first processor. Than if the owning thread will call Lock again - it doesn't recorgnize the recursive lock and we have a deadlock.

Right point, thanks. So, you insist on putting the memory fence here. But I think I have a better solution for this. Putting the memory barrier probably decreases the performance (like lock perfix does). But that was my point: to release critical section quickier then standard windows API does, without lock semantics.

And fortunately I know how to do this: In my implementation there're two variables that demand careful access: m_nLocker and m_dwOnwer. How about merging them ? That is, instead of calling _InterlockedCompareExchange with 1 - call it with the caller thread ID, the thread ID can not be 0, any other value fits our needs. And you don't need another variable. In such a way the problem you've described is solved, the Unlock remains fast, and the critical section object is shrinked: 4 32bit variables instead of 5.
BTW I thought about this before, but only about shrinking the object. Now, after I realize there's a problem with the current implementation - I'll implement the new one ASAP.

6. About m_bWantEvent. You may not reset it in the scenario you describe, because there may be multiple waiters. That is, you reset it, on the Unlock you do nothing, and the already-waiting threads remain in the suspended state. That's why there's only a single point in the code where m_bWantEvent is set to false: exactly before you set the event.

You're right, this implementation does require some extra calls to SetEvent/WaitForSingleObject. But I believe this impact is minor for the overall performance.

Conclusion: All you've stated made me think (which is already good). Most of the accusations are inapplicable IMHO, but the gotcha with the memory barrier is the right point, I'll rewrite the implementation ASAP. I'll also try to solve the extra SetEvent/WaitForSingleObject upon collisions, it is possible (though a bit more complex) to use interlocked waiters counter with a semaphore instead of just a boolean and the auto-reset event.

I'd like to see your comment about the new implementation.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.