|
Hi,
Following is a simple stress test that crashes after a few seconds. Am I missing something here?
#include "stdafx.h"
#include "LockFreeQ.h"
#include <iostream>
using namespace std;
MSQueue< int > Q;
int nReadThreads = 1;
int nWriteThreads = 1;
DWORD WINAPI ReadThreadFunc(LPVOID p)
{
while (true)
{
int i;
Q.dequeue(i);
}
}
DWORD WINAPI WriteThreadFunc(LPVOID p)
{
int i=0;
while (true)
{
Q.enqueue(i++);
}
}
int _tmain(int argc, _TCHAR* argv[])
{
for (int i=0; i<nwritethreads;> {
CreateThread(NULL, 0, WriteThreadFunc, NULL, 0, NULL);
}
for (int i=0; i<nreadthreads;> {
CreateThread(NULL, 0, ReadThreadFunc, NULL, 0, NULL);
}
int n;
std::cin >> n;
return 0;
}
Hagai.
|
|
|
|
|
Hi!
IMHO this stress for manager memory: crash in 'new' command.
For function WriteThreadFunc need next change:
DWORD WINAPI WriteThreadFunc(LPVOID p)
{
int i=0;
while (true)
{
Q.enqueue(i++);
Sleep(1);
}
|
|
|
|
|
waht musst i to change to make this code comapibel with .Net 1.1
i mean with out generics
thank you
nidal
|
|
|
|
|
Regarding the Michael And Scott lock free queue:
Separating the CAS into separate operations is definitely a bug. In Scott's sample implementation, he shows a CAS function written in assembly that uses a single word compare/swap. In his proof-of-concept code he mocks memory allocation by using a short as his pointer and another short as his counter. His CAS algorithm is implemented as an sc (store/conditional) MIPS instruction written as inlined assembly. This is necessary because the allocator can (and will) recycle memory addresses. If you could assume that the allocator, for the life of the entire queue, will never recycle and address then the pointer tag is not necessary. Realistically, that is not practical. So the pointers are tagged to ensure that if one node was dequeued (and possibly re-enqueued with the same memory address) that it will skip that node. With this implementation, the ABA problem could still occur. Because it's possible to atomically write the pointer to memory, but then its tag lags behind by a few instructions. Not to mention the copying of the objects from scope to scope, also can be potentially racy.
What you need to do is use a double compare/swap which is available on Vista and Xbox360 as InterlockedCompareExchange64. On Xbox360 it's implemented using locked reads/writes and on 32 bit windows is implemented as cmpxchg8 (yes, the infamous F00F bug instruction). The downside to using the InterlockedCompareExchange64 is that you must cast the pointer/tag pair to a LONG64 which is technically undefined behavior. This can cause a failure if the memory isn't aligned on an 8 byte boundary, which is the natural alignment requirement for a long64. I've used a struct pair and a union of a long64, and that seems to do the trick. To rectify the alignment, you'll need to use __declspec( align(8) ) if you don't use a union to ensure that the compiler will align the pair of values properly. Also, if InterlockedCompareExchange64 isn't available, then you've gotta roll your own in assembly. If you do that, it's rather simple but be sure to follow the compiler's ABI.
It is important to also note that alignment is important. On most architectures (such as the PPC based Xbox360) the processor will set an exception flag when unaligned memory is accessed and the process causes a bus error, or in my case the console just locks up, the game quits and everybody stops having fun. On Intel architecture, it's a little different. Intel processors will set the exception flag, but then the exception is handled by replacing the single read/write instructions in the pipeline with a series of instructions to correct the erroneous bus transaction. Once that happens, the cmpxchg8 instruction is no longer atomic and we're still left with a race condition.
modified on Monday, April 7, 2008 9:10 PM
|
|
|
|
|
Thanks for the comment! At some point, I'll take the time to correct my article. I appreciate your reading and correcting me.
Idaho Edokpayi
|
|
|
|
|
Hello,
I am still not sure that the following code is really safe:
pointer_t(const pointer_t* p): ptr(NULL),count(0)
{
if(NULL == p)
return;
InterlockedExchange(&count,const_cast< LONG >(p->count));
InterlockedExchangePointer(ptr,const_cast< node_t*>(p->ptr));
}
Imagine, your source pointer being completely modified between both Interlocked lines. The count you have is not consistent with the current one.
then when in your code you have:
...
// Read Tail.ptr and Tail.count together
pointer_t tail(Tail);
...
You may be interrupted between the interlockedExchange of your copy constructor. As Tail is a reference, you may end with tail (the copy) having the count of the tail at moment T0 and the pointer of the Tail at moment T1 (different moments, possibly different pointer_t ...
If the Tails have different count => it will be detected but... If no luck both Tail pointers at various moments may have the same count.
I think after that the queue might be messy.
I have come accross a InterlockedExchange64 function that might help to solve this issue. Concatenate both pointer and counter in a 64 bit value, and access it atomically!
Hopefully InterlockedExchange64 is lockfree (I am not sure).
Finally, I may have not gone deep enough to find out my assumptions are false, and that this implementation is correct.
Thanks for the work and for any answer.
Sincerely,
Didier HARRANG
|
|
|
|
|
You could be right. But what of a 64 bit implementation?
I think I prefer taking a copy early in the constructor then only making the exchange if the pointer is unchanged. Thanks for the commment!
Idaho Edokpayi
|
|
|
|
|
The problem is that what is pointed to can change, but the address does not. Assume Q has a few nodea already then
pop invoke(thread A)//gets node pointer X, but suspends before CAS invoked
pop invoke(B)
pop return (B)//node pointer X deleted
push enter (B)//node pointer X' allocated (allocator uses exact same address as X -- why not?)
push return (B)//X' now
B sleeps
A wakes up, sees that X'==X returns X'
and FIFO is not satisfied.
The tags (counter) and the poitners have to be compared and swapped at the same time. Typically double word CAS is used, but there are schemes to pack the tags in the low order bytes of the pointer values too. 128 bit CAS is not part of the MSVC intrinsics, but that doesnt mean that a) it isnt supported by hardware, so you could write assenmbly to get at it, and b) that incorrect code must be written.
A small note: if your MSQueue is deleted before all items have been popped, you will have a memory leak. the Dtor should something like T v; while(deque(v)); in it to drain off all the items, if there are any.
The reason for selecting different allocators isn't strictly performance. Allocators can align the nodes on cache line boundaries, substantially reducing false sharing (thus contention), it can make room for those tag values, and also since the nodes are all the same size, making a lock free free list allocator is very easy. Even a system allocator with fast locks is subject to convoying, priority inversion, and other nasty things.
Lance
|
|
|
|
|
The article mentions the research in a 1996 paper by Maged M. Michael and Michael L. Scott on non-blocking and blocking concurrent queue algorithms, and states that ...
In fact, their queue implementation is now part of the Java concurrency library.
Where are you getting this fact from? It frankly does not appear defensible. The principal author/architect of the Java concurency library is Doug Lea at Oswego (see http://g.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/intro.html[^]) not either of Maged Michael or Michael Scott.
It's true that Micheal Scott won the 2006 "Edsger W. Dijkstra Prize in Distributed Computing", for work he did in a 1991 paper which describes a mutual exclusion lock based on a queue. The 1991 paper is different from the 1996 paper that you cited. It's also true that his mutual exclusion lock forms the basis for monitor locks used in Java VMs. See http://www.podc.org/dijkstra/2006.html[^].
But a mutual exclusion lock is very clearly a locking technique, not a lock-free technique. And the paper you cited is not the one that forms the basis for monitor locks in Java.
So, I don't understand the statement in the article, to the effect that the 1996 paper is somehow the basis for the entirety of the Java concurrency library.
Please explain further.
PS: Professor Scott is a remarkable person with many achievements, which you can see from his University of Rochester website: http://www.cs.rochester.edu/~scott/[^]. With my question above, I intend no disrespect for him or any of his accomplishments. I simply seek justification for a fairly sweeping comment.
|
|
|
|
|
I cannot find my original research which lead me to believe that they themselves wrote the queue implementation in the concurrency library. I've been busy since then. However this is what the concurrency library page said: "This implementation employs an efficient "wait-free" algorithm based on one described in Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms by Maged M. Michael and Michael L. Scott." Re-reading what I said, I do not understand how you would conclude that I was saying the paper was the basis for the entirety of the Java concurrency library. Your quote of what I said explains pretty clearly what I believed "..their queue implementation is now part of the Java concurrency library.:
Idaho Edokpayi
|
|
|
|
|
In the ArrayQ::Resize() function (C++ file), what is the purpose of the CRITICAL_SECTION?
It's being created on the stack, inside the local scope of the Resize() function. Thus, every single thread that calls the function will get its own copy of a different critical section. As such, it cannot serve any possible thread-synchronization or data-protection functions.
Can you shed any light on this?
Mike
|
|
|
|
|
I've since revised the code and removed the ArrayQ class since I believed the queue based on the paper was a stronger and more defensible implementation. I actually don't recommend the ArrayQ class. I am also planning a fix for the performance problem in the C# queue.
Idaho Edokpayi
|
|
|
|
|
You avoid to respond the question. The question is not about if ArrayQ is stronger or not than other implementation, the question is about a "conceptual mistake" in the implementation of ArrayQ. Why not simple accept an error?
|
|
|
|
|
Looking back, (way back) the ArrayQ wasn't that great an implementation. I noticed the error and decided that the other algorithm was more likely to produce the results I was seeking, so it would have been a waste of my time and the reader's time to fix code that shouldn't have been used at all. I acknowledge that there are weak points still. My intention is to fix them but I don't have the time lately.
Idaho Edokpayi
|
|
|
|
|
It works, however performance suffers because of the consistancy check.
The problem is the consistency check fails often under high read/write contention because multiple calls to Interlocks cannot synchronize a single state (e.g., the second Interlock attempting to keep the state consistant through variabloe count is causing the problem).
To show the problem, I ran a simple test with on a four core machine using two threads to populate the queue with integers (2 x 1M) and two threads consume from the queue. I created a second TestQueue class backed by a LinkedList<t> using monitor locking.
MSQueue averages 2.8 seconds
TestQueue averages 0.54 seconds
To run MSQueue, define USEMSQUEUE. To run TestQueue, do not define USEMSQUEUE
///////////////////////////////////////////////////////////////////////
#define USEMSQUEUE
using System;
using System.Collections.Generic;
using System.Threading;
namespace lfq
{
public class MSQueue<T>
{
public class node_t
{
public T value;
public pointer_t next;
/// <summary>
/// default constructor
/// </summary>
public node_t() { }
public override
string
ToString()
{
return string.Format("node_t(value={0}):{1}", value.ToString(), next.ToString());
}
}
public struct pointer_t
{
public long count;
public node_t ptr;
/// <summary>
/// copy constructor
/// </summary>
/// <param name="p"></param>
public pointer_t(pointer_t p)
{
ptr = p.ptr;
count = p.count;
}
/// <summary>
/// constructor that allows caller to specify ptr and count
/// </summary>
/// <param name="node"></param>
/// <param name="c"></param>
public pointer_t(node_t node, long c)
{
ptr = node;
count = c;
}
public override
string
ToString()
{
return string.Format("pointer_t(count={0}, ptr={1})", count, ptr == null ? "null" : ptr.ToString());
}
}
public pointer_t Head;
public pointer_t Tail;
public MSQueue()
{
node_t node = new node_t();
Head.ptr = Tail.ptr = node;
}
/// <summary>
/// CAS
/// stands for Compare And Swap
/// Interlocked Compare and Exchange operation
/// </summary>
/// <param name="destination"></param>
/// <param name="compared"></param>
/// <param name="exchange"></param>
/// <returns></returns>
private bool CAS(ref pointer_t destination, pointer_t compared, pointer_t exchange)
{
if (compared.ptr == Interlocked.CompareExchange(ref destination.ptr, exchange.ptr, compared.ptr))
{
Interlocked.Exchange(ref destination.count, exchange.count);
return true;
}
return false;
}
public bool deque(ref T t)
{
pointer_t head;
// Keep trying until deque is done
bool bDequeNotDone = true;
while (bDequeNotDone)
{
// read head
head = Head;
// read tail
pointer_t tail = Tail;
// read next
pointer_t next = head.ptr.next;
// Are head, tail, and next consistent?
if (head.count == Head.count && head.ptr == Head.ptr)
{
// is tail falling behind
if (head.ptr == tail.ptr)
{
// is the queue empty?
if (null == next.ptr)
{
// queue is empty cannnot dequeue
return false;
}
// Tail is falling behind. try to advance it
CAS(ref Tail, tail, new pointer_t(next.ptr, tail.count + 1));
} // endif
else // No need to deal with tail
{
// read value before CAS otherwise another deque might try to free the next node
t = next.ptr.value;
// try to swing the head to the next node
if (CAS(ref Head, head, new pointer_t(next.ptr, head.count + 1)))
{
bDequeNotDone = false;
}
}
} // endif
} // endloop
// dispose of head.ptr
return true;
}
public void enqueue(T t)
{
// Allocate a new node from the free list
node_t node = new node_t();
// copy enqueued value into node
node.value = t;
// keep trying until Enqueue is done
bool bEnqueueNotDone = true;
while (bEnqueueNotDone)
{
// read Tail.ptr and Tail.count together
pointer_t tail = Tail;
// read next ptr and next count together
pointer_t next = tail.ptr.next;
// are tail and next consistent
if (tail.count == Tail.count && tail.ptr == Tail.ptr)
{
// was tail pointing to the last node?
if (null == next.ptr)
{
if (CAS(ref tail.ptr.next, next, new pointer_t(node, next.count + 1)))
{
bEnqueueNotDone = false;
} // endif
} // endif
else // tail was not pointing to last node
{
// try to swing Tail to the next node
CAS(ref Tail, tail, new pointer_t(next.ptr, tail.count + 1));
}
} // endif
} // endloop
}
}
class TestQueue<T>
{
public
void
enqueue(
T item
)
{
lock (_list)
{
_list.AddFirst(item);
}
}
public
bool
deque(
ref T item
)
{
LinkedListNode<T> last = null;
lock (_list)
{
last = _list.Last;
if (last != null)
{
item = last.Value;
_list.Remove(last);
}
}
return last != null;
}
LinkedList<T> _list = new LinkedList<T>();
}
class Program
{
const int CHECKSIZE = 64;
const int CONSUMERS = 2;
const int PRODUCERS = 2;
const int PRODUCESIZE = 1000000;
#if USEMSQUEUE
static MSQueue<int> queue = new MSQueue<int>();
#else
static TestQueue<int> queue = new TestQueue<int>();
#endif
static bool consumersDone = false;
static void Main(string[] args)
{
Thread[] consumers = new Thread[CONSUMERS];
Thread[] producers = new Thread[PRODUCERS];
System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
sw.Start();
Start(producers, producer);
Start(consumers, consumer);
Join(producers);
consumersDone = true;
Join(consumers);
sw.Stop();
Console.WriteLine("{0 .0000} seconds", sw.Elapsed.TotalSeconds);
}
static void Join(
Thread[] threads
)
{
foreach (Thread t in threads)
{
t.Join();
}
}
static void Start(
Thread[] threads,
ThreadStart method
)
{
for (int i = 0; i < threads.Length; i++)
{
threads[i] = new Thread(method);
threads[i].Start();
}
}
static void producer()
{
for (int i = 0; i < PRODUCESIZE; i++)
{
queue.enqueue(i);
}
}
static void consumer()
{
int n = 0;
for (; ; )
{
if (queue.deque(ref n))
{
}
else if (consumersDone)
{
break;
}
}
}
}
}
|
|
|
|
|
I'll look into it.
Idaho Edokpayi
|
|
|
|
|
"Lock free" algorithms are riddled with patents. Everyone considering to use them should be aware of this fact.
|
|
|
|
|
Do you know where to find info on such patents?
Idaho Edokpayi
|
|
|
|
|
|
The Wizard of Doze wrote: "Lock free" algorithms are riddled with patents.
Has someone claimed a patent on the basic pattern:- Latch old pointer to current object
- Create new version of object using information at latched pointer
- Compare-and-swap the current object pointer to the new one if it was equal to the old one
- If the CAS operation failed, repeat
Many lock-free algorithms are formed by taking long-standing existing algorithms and wrapping them in that fashion, so any patent on such algorithm would be void due to obviousness.
|
|
|
|
|
You can't patent an algorithm. It's an idea, which isn't patentable.
|
|
|
|
|
Just like other lock-free data structure found on the web, the memory allocation/de-allocation in enqueue/dequeue will perform locking during the operation. Therefore the MSQueue is not lock-free unless some lock-free memory allocation scheme is applied. An "almost" (the first few allocations will touch the global heap anyway) lock-free free list memory allocator should not be hard to implement base on the code provided in ArrayQ and MSQueue .
Happy coding.
|
|
|
|
|
I'll revise the article to note that to achieve complete "non-blocking" status the code must use non-blocking memory allocators. However, I did some research into custom memory allocation and what I found is that while it is easier to outperform early memory allocators, many newer memory allocators easily outperform all but the absolute cleverest implementations including the default memory allocator for Windows Vista. See this guy's blog He was able to remedy it. So, I suggest interested readers substitute his code if they want a truly wait free algorithm. My problem is that although the algorithm is wait-free and out performs the default allocator - does it still beat a library such as HeapLayers or Hoard? There are high performance memory allocation libraries out there that could possibly not be wait-free but faster still. In the end, we all want speed and ease of implementation. I suggest using HeapLayers or Hoard before his library and accepting one or two locks. Your code will be faster, more stable and no one will care that the architecture is not pure "wait-free".
Idaho Edokpayi
|
|
|
|
|
I think you are missing a point: Michael and Scott's algorithm, as expressed in pseudocode, is wait-free; however, it's many implementations in "real" language are not wait-free.
The problem here is that the new_node() (with an underscore '_') is often incorrectly replaced with new node() (with a space ' ').
Here is what Michael and Scott write in their paper (part 2, second paragraph):
We use Treiber’s simple and efficient non-blocking stack algorithm [21] to implement a non-blocking free list. However, the pseudocode does not mention free lists except in the comments to line E1: # Allocate a new node from the free list . I think this is the key - the algorithm needs a pre-allocated free list implemented as Treiber’s stack in order to be wait-free.
|
|
|
|
|
Just like other lock-free data structure found on the web, the memory allocation/de-allocation in enqueue/dequeue will perform locking during the operation.
Does the .net allocator used when creating a new object use locking rather than spinlocking? If so, does it contain measures to prevent priority inversion and other such nastiness?
|
|
|
|
|