Saturday, May 31, 2014

Volatile, Memory Barriers, and Load/Store Reordering

Introduction

Volatile is a memory access keyword in several languages which is much more complicated and confusing in practice then what it appears to be. There are pieces of information about volatile and related concepts everywhere but it's hard to build a complete understanding. Let's first understand the underlying concepts.

Some Memory Concepts

Load/Store

This is a more formal terminology for memory read/write. Load means brining something from main memory (ram) to processor registers, and store is the same thing in other direction.

Load/Store Omission by the Compiler

Every memory read/write in a program source code doesn't have to result in a processor instruction that does an actual load/store. Compilers can decide to cache values instead of actually loading them from memory when they think it doesn't change the results of the program. This can result in buggy programs in multi threaded programs and in programs with 'unusual memory'.

'Unusual Memory'

Unusual memory refers to a memory location that can change in value without compiler having anyway to know it. Hardware mapped memory is such a case. If a sensor hardware is updating a memory location on its own, compiler can't know about this and incorrectly omit stores from that memory location assuming the value will never change.

Load/Store Reordering by the Compiler

Compilers can reorder loads and stores in a program for optimization reasons, if the reordering won't change program results. This logic can breakdown in multithreaded programs and cause buggy behavior in the program.

Out of Order Execution and Cache Coherency

Similar to compilers reordering loads/stores, CPUs can do the same via 'out of order execution'. This means that a CPU might execute instructions in a slightly different order then they appear in computer memory which includes instructions getting reordered and interleaved execution of two or more instructions. Interleaved execution means instructions aren't strictly ordered in their execution i.e one instruction might begin its execution before the 'already in progress' instruction is 'finished'.

Also, when a CPU stores a value in a memory location, all CPUs may not see the same stores in the same order because of cache coherency issues i.e each processor having its own cache of the main memory.

 These issues create buggy behavior in multithreaded programs. To prevent this programmers can use memory barriers, which are special instructions telling the CPU not to do out of order execution across the barrier instruction. Memory barriers also ensure cache consistency which means we won't get hurt by cache coherency problems.

Atomic Load/Store

Loading OR storing a primitive value from main memory may or may not be atomic. Don't confuse this with 'read, modify, write' atomicity which can only be achieved either by proper locking or special 'interlocked' instructions. On certain hardware platforms loading an integer may not be atomic i.e some bytes of the integer may be modified while we're reading the integer which means we'll end up with a value of the integer that was never the value of it any point in time. Pointer aligned word sized (32bits) load/store operations on X86 architecture are atomic.

Volatile in C/C++ vs. Java and C#

Volatile has different meanings in different programming languages. Let's see these differences.

C/C++

Volatile provides:
  • When a volatile variable is loaded/stored, it will actually be loaded/stored and the load/store won't be cached or omitted. Code reading volatile variable will always see the latest value.
  • Loads/Stores of volatile variables won't be reordered i.e they'll appear in the same order they appear in the source code.
Volatile doesn't provide:
  • Atomic load/store of the volatile variable. For example if the volatile variable is bigger than word size or it's not pointer aligned on an X86 h/w load/store operations on the variable may not be atomic. Again this is different than read, modify, write atomicity which is an altogether different problem.
  • Guarantee of not reordering non volatile load/store operations around the volatile loads/stores. Compiler can place a non volatile store operation after a volatile store operation even if non volatile store happens before in the source code. This is actually called the 'happens before' relationship, and volatile keyword doesn't provide that in C/C++.
  • Memory barriers around volatile loads/stores. CPU can still reorder instructions around load/store of volatile variables and there can still be CPU cache inconsistencies.
Because of these issues volatile is only recommended for memory mapped I/O (i.e unusual memory) in C/C++. If our only concern is seeing the latest value set by a h/w device or some other process that we're sharing memory with, volatile does the job.

Java/C#

Volatile provides:
  • Guarantee that load/store operations won't be cached/omitted. Code reading volatile variable will always see the latest value.
  • Guarantee that load/store operations of volatile variables won't be reordered, including the non volatile loads/stores around them. This guarantee came to Java starting with Java 5 only. We'll see why this is important in the case study.
  • Atomic load/store of the volatile variable i.e loads/stores to a volatile variable won't be interleaved.
  • Memory barriers. CPU won't do out of order execution of volatile loads/stores (and other loads/stores around them) and it won't cause cache inconsistencies.
Volatile doesn't provide:
  • A solution to all your multi threading problems. There are countless scenarios where you'd need proper locking via mutexes etc. or interlocked API that operate on data atomically.

A Case Study

Let's look at a small piece of pseudocode which will provide us with the basis for discussion.

MyType my_global;
volatile bool f_my_global_ready = false;

void thread1()
{
  my_global = setup_my_object();
f_my_global_ready = true;
}

void thread2()
{
while (!f_my_global_ready);
use_my_global_object();
}

Code is fairly simple. One thread initializes a global object and sets the flag indicating that it's ready, the other busy waits (polls) until the flag is set and then uses the object prepared by the other thread.

I should mention that while loads/stores of f_my_global_ready are volatile load/store operations, initialization and consumption of my_global consists of non volatile loads/stores. This will be important in terms of reordering guarantees.

Now let's think about how using volatile keyword in different languages would result for the above piece of code.

Analysis of the Case Study

C/C++

The code specified in case study won't work as expected in C/C++ and there are several reasons.

If the compiler optimizes the read of f_my_global_ready in Thread2, we may never see the flag getting set and the while loop will keep going forever. To prevent compiler doing this optimization we can declare f_my_global_ready as volatile.

However, even if we make the flag volatile there are other issues making the code buggy. First there's the atomicity of the load. In this specific example we're loading a bool which is a single byte so I doubt this will be a problem on any h/w platform. But if we were working with a multi byte primitive such as 32 bit integer we could be vulnerable to race conditions, depending on the h/w platform and depending on how the integer is aligned in memory. Volatile doesn't fix these issues in C/C++.

Second, compiler won't optimize or reorder loads/stores to the volatile flag but it can reorder other 'non volatile' load and stores around it. This means that part of the instructions setting up my_global might get placed by the compiler after the setting of the flag. This will result in Thread2 breaking the while loop and using my_global before my_global was fully initialized. This would be a serious bug.

Third, similar to the compiler reordering, the CPU might do out of order execution which again will result in Thread2 accessing a not fully initialized my_global. Cache coherency issues can also result in the same problem as other CPUs might see loads/stores in a different order.

Java/C#

For Java (5 and later) and C# code in the case study will work just fine. Thanks to the guarantees given by volatile in these languages everything works as expected. Let's go over these one more time:

  • After the flag is set, Thread2 will see this new value the next time it polls the flag because it's declared as volatile and compiler won't cache the value for the flag.
  • Load/store of the variable is atomic (again, not read, modify, and write) so we're not concerned with reading an interleaved value accidentally.
  • Initialization of my_global will be done strictly before the f_my_global_ready flag is set because compiler doesn't reorder non volatile load/store operations across volatile load/store operations. The volatile load/store acts as a strict barrier for all other loads/stores.
  • CPU won't do out of order execution or have cache inconsistency for the volatile loads/stores thanks to the compiler inserted memory barrier before the volatile loads/stores.

Solution to the C/C++ Case

Volatile doesn't help with the code we've been working on in C/C++, so what other options do we have?

  • If we're using C++ 11, we can use the std::atomic<T> types which act in the same way as volatile in C#/Java.
  • We can use the Interlocked* API on Windows or similar API present on other platforms.
  • Use proper locking techniques like spinlocks, mutexes, semaphores etc.

Further Reading 




No comments:

Post a Comment