Memory Barriers and Thread Synchronization

Martin Vorbrodt

0/5 (0 vote)

Mar 15, 2019

MIT

2 min read

4464

Memory barriers and thread synchronization

We will jump straight to the code. This innocent looking little program has a major issue (when compiled for release build with optimizations on my Mac using GCC, Apple’s CLANG, and LLVM, as well as on Windows using Visual Studio 2017, and ran on a multicore machine). Can you spot the problem?

#include <iostream>
#include <thread>
using namespace std;

int main(int argc, char** argv)
{
	bool flag = false;
	
	thread t1([&]() {
		this_thread::sleep_for(100ms);
		cout << "t1 started" << endl;
		flag = true;
		cout << "t1 signals and exits" << endl;
	});
	
	thread t2([&]() {
		cout << "t2 started" << endl;
		while(flag == false) ;
		cout << "t2 got signaled and exits" << endl;
	});
	
	t1.join();
	t2.join();
	
	return 1;
}

That’s right! It will never terminate! It will hang forever! The while loop in line 18 will never break. But why? Thread t1 sets flag to true after all. Yes, but it does so too late (notice the 100ms sleep). At that point, thread t2 has already L1 cached flag and will never see its updated value. If you think that making flag volatile will help you’re wrong. It may work on your compiler/machine but it is no guarantee. Now what?

This was one of the hardest lessons in C++ and computer science for me. Before continuing to the fix section, I highly recommend you read about the following: memory barriers, C++ memory model as well as C++ Memory Model at Modernest C++, and memory ordering. I’ll see you in a couple of days.

The Fix

The simplest fix is to wrap access to flag around a mutex lock/unlock or make flag an atomic<bool>(both of those solutions will insert appropriate memory barriers). But that’s not always an option for other data types…

We need to make sure that t2 can see the actions of t1 that happened later in time. For this, we need to force cache synchronization between different CPU cores. We can do it in three ways:

By inserting memory barriers in the right places
By inserting loads and stores of an atomic variable using release/acquire semantics
By inserting loads and stores of a dependent atomic variable using release/consume semantics

Below is the corrected version of our example; uncomment each #define to engage different fixes:

#include <iostream>
#include <atomic>
#include <thread>
using namespace std;

//#define ATOMIC_FENCE
//#define ATOMIC_RELEASE
//#define ATOMIC_CONSUME

#if defined ATOMIC_FENCE
#define FENCE_ACQUIRE atomic_thread_fence(memory_order_acquire)
#define FENCE_RELEASE atomic_thread_fence(memory_order_release)
#elif defined ATOMIC_RELEASE
atomic_bool f{false};
#define FENCE_ACQUIRE f.load(memory_order_acquire)
#define FENCE_RELEASE f.store(true, memory_order_release)
#elif defined ATOMIC_CONSUME
atomic_bool f{false};
#define FENCE_ACQUIRE f.load(memory_order_consume)
#define FENCE_RELEASE f.store(flag, memory_order_release)
#else
#define FENCE_ACQUIRE
#define FENCE_RELEASE
#endif

int main(int argc, char** argv)
{
	bool flag = false;

	thread t1([&]() {
		this_thread::sleep_for(100ms);
		cout << "t1 started" << endl;
		flag = true;
		FENCE_RELEASE;
		cout << "t1 signals and exits" << endl;
	});

	thread t2([&]() {
		cout << "t2 started" << endl;
		while(flag == false) FENCE_ACQUIRE;
		cout << "t2 got signaled and exits" << endl;
	});

	t1.join();
	t2.join();

	return 1;
}