Lock-free in Swift: Barriers

20 min readDec 14, 2022

I want to start another series of articles about lock-free algorithms and how we can implement them using swift atomics framework. It’s a very complicated topic with a lot of hidden stones there. I want to mention that it’s very difficult to debug and find mistakes. Sometimes algorithms can behave very unpredictably. So it’s often recommended to use usual locks instead of lock-free algorithms but I believe studying them can help to understand how processor and memory work. We will start with the basics in this article and discuss what are atomics, memory barriers, and MESI protocol.

Atomic operations.

An atomic operation is an operation that cannot be split into parts while it’s executing. So basically it can be executed or not executed, there is no intermediate state. There are 3 types of atomic operations: read, write, and read-write-modify (RMW). All these operations are usually implemented as processor commands and what is important they can be different for any type of processor. For example, Intel and ARM have different instructions for the CAS command. Good thing is that the high-level atomics framework covers this issue and we can more focus on the algorithm implementation. That part will cover only read and write atomic operations. About RMW you will learn in the next part.

For now, we don’t need to dig into the swift atomics framework syntax. But we can use store (for writing) and load (for reading) as atomic operations. The interesting thing is that for modern processors atomicity is guaranteed only for aligned integral types like integers or pointers but reading unaligned data is not atomic. The compiler can guarantee correct alignment for integral types. To get a better understanding of atomic operations let’s start with the simplest mutex using only store and load for two processors. It’s called Peterson’s lock. In the implementation below you can see that we are controlling the lock through the cycle while and three variables A, B, and turn. Whenever one processor executes the critical section it set its variable A (or B in the P1 case) to 1 and another processor has to wait until the first will be done.

Initial data:
A = 0, B = 0 and turn = 0

P0: 
A = 1 // store A, 1
turn = 1 // store turn, 1
// load B
// load turn
while (B == 1 && turn == 1) { // Wait }
// Critical section
// Do some work
// End critical section
A = 0 // store A, 0

P1:
B = 1 // store B, 1
turn = 0 // store turn, 0
// load A
// load turn
while (A == 1 && turn == 0) { // Wait }
// Critical section
// Do some work
// End critical
B = 0 // store B, 0

Unfortunately, there are some issues that can happen with that code.

Memory consistency

And now we are touching very important problem here. This lock will work only in sequential consistency execution. That means we expect the order of the execution to be the same as it’s written. That is how the programmer expects it to be executed but the instructions can be reordered by the processor or/and compiler. In the past, we had only 1 core processor it was easier to support sequential memory consistency but nowadays the processor is still working in that paradigm: even if it has multiple cores it operates as if it has only one core and that makes it more complicated to provide a sequential consistency.

Modern processors in terms of optimizations do reorders. They put the read operation in the beginning because the read from the memory is an expensive operation and sometimes it can be reordered in the beginning. While the write/load operation is cheaper you don’t need to wait until it ends. That’s why it’s called relaxed memory consistency.

Let’s see how the implementation of Peterson’s algorithm can be reordered:

Initial data:
A = 0, B = 0 and turn = 0

P0:
// load B from the cycle got executed before store A
A = 1 // store A, 1
turn = 1 // store turn, 1
// load turn
while (B == 1 && turn == 1) { // Wait }
…

P1:
// load A from the cycle got executed before store B
B = 1 // store B, 1
turn = 0 // store turn, 0
// load turn
while (A == 1 && turn == 0) { // Wait }
…

P0 and P1 got executed their load B and load A before the actual stores. That means they can execute critical sections at the same time. Gladly we can protect our code from reordering. Memory barriers prevent reorders on the processor layer. Sometimes memory barriers can be very heavy even heavier than usual mutexes. That’s why programmers should be extra careful using them. Let’s see how we can use them in our example:

Initial data:
A = 0, B = 0 and turn = 0

P0:
A = 1 // store A, 1
turn = 1 // store turn, 1

memory_barrier()

// load B
// load turn
while (B == 1 && turn == 1) { // Wait }
// Critical section
// Do some work
// End critical section

A = 0 // store A, 0

P1:
B = 1 // store B, 1
turn = 0 // store turn, 0

memory_barrier()

// load A
// load turn
while (A == 1 && turn == 0) { // Wait }
// Critical section
// Do some work
// End critical
B = 0 // store B, 0

So this pseudo-code will work as it is supposed to because we set memory barriers to avoid reorders of the instructions. I intentionally didn’t specify memory barrier commands because they are different for platforms and os. The barrier in this context means sequential ordering: instructions from above the barrier command will be not mixed with the instructions below it. The processor would not be able to reorder load B before store A for P0 and load A before store B for P1.

Cache coherence

For a better understanding of why we need the memory barriers and how they work we should look into the hardware level and see what’s going on there. In modern architecture, processors use their own caches (actually multiple caches per processor) to avoid working with the memory directly because it’s a relatively expensive operation time-wise. But that causes a problem since each processor has its own local data on its cache so the data can differ from the others and the shared memory. To understand that let’s see an example:

There are two processors P0 and P1 
Each has its own cache C0 and C1 
There is a shared memory M which they can access
Memory M contains value A = 0

Step 1
P0 loads A from Memory, C0->A = 0
 — — State — -
M = A = 0
C0 = A = 0

Step 2
P1 loads A from Memory, C1->A = 0
 — — State — -
C0 = A = 0
C1 = A = 0
M = A = 0

Step 3
P0 stores 1 to A in its cache, C0->A = 1
 — — State — -
C0 = A = 1
C1 = A = 0
M = A = 0

The cache becomes incoherent.

To prevent this data difference we can use a group of techniques called cache coherence. There are many different types of cache coherent systems based on the amount processors in the system or technical decisions of the company that produces processors but we don’t need to know all of them. Instead, we can focus on a MESI protocol — it’s one of the most popular protocols. This protocol and a bunch of others can be grouped by the same technique they use to maintain coherence — invalidation. The basic idea is pretty simple: whenever the processor wants to modify a value in its cache it sends a signal through the connection line (bus, it connects processors, caches, and memory with each other) to the others to invalidate their caches.

MESI is an acronym for the 4 states of this protocol M — modified, E — exclusive, S — shared, and I — invalid. As we discussed before simply the processor architecture consists of 3 key CPU itself, CPU cache, and memory further we will introduce some additional components. All these states are needed to organize data sharing between multiple processors. Let’s take a look at each state:

Modified — The CPU wrote a memory block to its cache and it guaranteed that the memory block belonged only to this CPU cache. The CPU owner of the memory block can read and write it without communicating with others CPUs. Example: CPU has a memory block in the state exclusive and it wants to modify it. It doesn’t need to send extra messages so it just changes the state to modified.

Exclusive — is similar to modified except it was not changed by CPU. Example: CPU0 reads a memory block from Memory and puts it in its cache and there are no other owners (CPUs) of that memory block in the system.

Shared — there are at least two caches that own the memory block. CPU can read it without communicating with other CPUs but cannot write the memory block without notifying others. Example: CPU has a memory block in the state shared and then it wants to modify it, CPU sends invalidate message to the other CPU and receives acknowledgment of invalidation, and then changes state to modified.

Invalid — the memory block in an invalid state is not present in the CPU cache. On attempt to read the invalid memory block CPU will catch cache miss and will need to get it from Memory or other CPU caches. Example: CPU sends a read request to Memory and after it receives a response it writes a memory block on the spot that was invalidated before.

There are three types of messages that CPUs use to communicate with each other to access memory blocks in MESI architecture:

- Read request and read response — request for the specific memory block if CPU doesn’t own it. As result, it receives a response with the memory block. Read response can be received from another’s CPU cache or from the memory.

- Invalidate request and invalidate response — in the case when one processor wants to own a memory block (or basically to modify a shared memory block) it sends invalidate request to the other caches. Whenever the processor receives invalidate message it should set the memory block invalid state and then send acknowledge message to the sender.

Intentionally I didn’t specify read invalidate and writeback messages to keep the simplicity and since the model is abstract don’t overload the readers. Full table of messages and state diagram you can find in [5].

Let’s look at the first example to see how it works in MESI design:

Initial data:
There are P0 and P1, addresses of A in both cache have invalid state. 
Memory contains 0 by address A.

Step 1:
P0 sends read message for memory block A and receives read response from Memory.
C0->A becomes shared from invalid state.

Step 2:
P1 sends read message for memory block A and receives response from C0 cache 
since it is faster than read from Memory.
C1->A becomes shared from invalid state.

Step 3:
P0 wants to modify C0->A. It sends invalidate message to the bus.

Step 4:
P1 receives invalidate message and turn its cache C1->A to invalid state. 
Then it sends invalidate acknowledge to the bus.

Step 5:
P0 receives acknowledge message and change C0->A to exclusive.

Step 6:
P0 changes data by C0->A to 1 and set C0->A state to modified.

As you can see the initial example became more complicated in MESI architecture. As mentioned before it is strongly recommended to read [5] because it contains a lot of specific details and examples.

Invalidate and invalidate acknowledge is a pretty heavy operation if we consider them as a pair. CPU sends the invalidate message to the bus and then waits until it receives acknowledge message. To bypass that bottleneck was introduced store buffer for each CPU and its cache. So CPU puts the write operation into the buffer and continues to execute other operations until it will own the memory block. Let’s see how our example will be modified using store buffer:

Initial data:
There are P0 and P1, addresses of A in both cache have invalid state. 
Memory contains 0 by address A.

Step 1:
P0 sends read message for memory block A and receives read response from Memory.
C0->A becomes shared from invalid state.

Step 2:
P1 sends read message for memory block A and receives response from C0 cache 
since it is faster than read from Memory.
C1->A becomes shared from invalid state.

Step 3:
P0 wants to modify C0->A. It sends invalidate message to the bus.

Step 4:
P0 puts operation A = 1 to the store buffer. 
P0 continues execution instead of waiting but there is A = 0 in its cache.

…

Step N:
P1 receives invalidate message and sets state for A to invalid. 
Then it sends invalidate acknowledge to the bus.

Step N + 1:
P0 receives acknowledge message and executes write to C0 
from its store buffer. It sets C0->A to modified.

Store buffer allows asynchronous execution of write operations but it also produces some issues we will discuss them after all the optimizations. In the case of intensive activity of reading and writing data, it causes invalidations very often. That dramatically affects the performance of the system. The operation of invalidation data by address is heavy in itself. CPU can put an invalidation operation into the queue and guarantee that invalidation will happen before any new operations on that memory block. This queue is called invalidate queue and it requires that invalidation of the memory block will be executed before sending any MESI messages related to that memory block. To demonstrate how it works let’s continue with our example:

Initial data:
There are P0 and P1, addresses of A in both cache have invalid state. 
Memory contains 0 by address A.

Step 1:
P0 sends read message for memory block A and receives read response from Memory.
C0->A becomes shared from invalid state.

Step 2:
P1 sends read message for memory block A and receives response from C0 cache 
since it is faster than read from Memory.
C1->A becomes shared from invalid state.

Step 3:
P0 wants to modify C0->A. It sends invalidate message to the bus.

Step 4:
P0 puts operation C0->A = 1 to the store buffer. 
P0 continues execution instead of waiting.

…

Step N:
P1 receives invalidate message and puts invalidate operation 
into the invalidate queue and sends acknowledge message to the bus.

Step N + 1:
P0 receives acknowledge message and executes write to C0 from its store buffer.
It sets C0->A to modified.

…

Step N + M:
P1 executes invalidate operation on C1->A memory block.

There is one important rule important to mention: if the processor needs to access the memory block it can do it directly from its own store buffer (if it persists there) but it cannot access other processors' store buffers. And it is the opposite with invalidate queue — processors cannot access it.

You’ve probably noticed that store buffer and invalidate queue bring more asynchrony in the execution. That’s actually how the reorders happen on the hardware level. While store buffer holds the write operation the cache is still stored to the old value. Pretty much the same with invalidate queue the cache stores the memory block even if it has invalidation of it in invalidate queue.

Here is the full scheme of the multi-processor architecture:

Abstract scheme of multe-processor architecture

To see how it can happen let’s check the example with Peterson’s lock:

Initial data:
A = 0 is stored in cache C1 of P1 in state exclusive, 
B = 0 is stored in cache of C0 in state exclusive, 
turn = 0 is stored in C0 and C1 as shared.

P0:
// Step 1, Step 4, Step 5, Step 11
A = 1
// Step 12,
turn = 1

// Step 13
while (B == 1 && turn == 1) { }
// Step 14, Step 15
// Do some work
A = 0

P1:
// Step 2, Step 3, Step 6
B = 1
// Step 7, Step 8
turn = 0

// Step 9
while (A == 1 && turn == 0) { }
// Step 10
// Do some work

B = 0

Step 1:
P0 wants to change A but it doesn’t have A in C0. 
So it sends read message to the bus.

Step 2:
P1 wants to change B but it doesn’t have B in C1. 
So it sends read message to the bus.

Step 3:
P1 receives read message and change state of A to shared 
and returns response message.

Step 4:
P0 receives read message and change state of B to shared 
and returns respond message.

Step 5:
P0 receives response with A = 0 and put operation A = 1 in store buffer 
and sends invalidation for A.

Step 6:
P1 receives response with B = 0 and put operation B = 1 in store buffer 
and sends invalidation for B.

Step 7:
P1 already has turn = 0 in its cache so it doesn’t need to modify it

Step 8:
P1 receives invalidation for A put it in invalidate queue 
and sends acknowledge but in C1 still A = 0

Step 9:
P1 has turn = 0 in C1 and A = 0 (even it received invalidation for A) and 
B = 1 in store buffer. 
It can read A from store buffer so (A == 1 && turn == 0) returns false. 
P1 enters critical section.

Step 10:
P1 enters critical section.

Step 11:
P0 receives acknowledge for A so it executes A = 1 to its cache C0 
from the store buffer and changes state to modified.

Step 12:
P0 wants to modify turn so it puts turn = 1 in store buffer 
and sends invalidate.

Step 13:
P0 has A = 1, B = 0 and it has turn = 1 in its store buffer 
so condition (B == 1 && turn == 1) returns false.

Step 14:
P0 enters critical section but P1 has already entered it.

Step 15:
P0 receives invalidation for B but it’s too late 
since P0 and P1 in the critical section at the same time.

There is chaos here with the reorderings. That’s why we need to use memory barriers to bring some order to the execution. Let’s introduce some abstract barriers: Load Load barrier operates with invalidate queue so the call of that barrier forces to execute all the invalidate messages in the queue which means it makes an order between the load operations before the barrier and after it (that’s why Load Load). Store Store barrier operates with store buffer and it flushes all store operations from the buffer which implies the store operations before the barrier will be ordered with the store operations after. Now we can use barriers to fix Peterson’s algorithm.

Initial data:
A = 0 is stored in cache C1 of P1 in state exclusive, 
B = 0 is stored in the cache of C0 of P0 in state exclusive, 
turn = 0 is stored in C0 and C1 as shared.

P0:
// Step 1, Step 4, Step 5, Step 10
A = 1
// Step 11
turn = 1

// Step 12, Step 13, Step 17
Store Store barrier

// Step 18, Step 19
Load Load barrier

// Step 20, Step 21, Step 23
while (B == 1 && turn == 1) { }

// Step 24, Step 28
// Do some work

// Step N
A = 0
// Step N + 3, Step N + 6

P1:
// Step 2, Step 3, Step 6
B = 1
// Step 7, Step 8
turn = 0

// Step 9, Step 14, Step 15
Store Store barrier

// Step 16, Step 22, Step 25
Load Load barrier

// Step 26, Step 27, Step 29, Step 30, Step N + 1, Step N + 2, 
// Step N + 4, Step N + 5, Step N + 7
while (A == 1 && turn == 0) { }

// Step N + 8
// Do some work

B = 0

Step 1:
P0 wants to change A but it doesn’t have A in the C0. 
So it sends a read message to the bus.

Step 2:
P1 wants to change B but it doesn’t have B in the C1. 
So it sends a read message to the bus.

Step 3:
P1 receives the read message and changes the state of A to shared 
and returns a response message.

Step 4:
P0 receives the read message and changes the state of B to shared 
and returns the respond message.

Step 5:
P0 receives a response with A = 0 and put operation A = 1 in store buffer 
and sends invalidation for A.

Step 6:
P1 receives a response with B = 0 and put operation B = 1 in store buffer 
and sends invalidation for B.

Step 7:
P1 already has turn = 0 in its cache so it doesn’t need to modify it.

Step 8:
P1 receives invalidation for A put it in invalidate queue 
and sends acknowledge but in C1 still A = 0

Step 9:
P1 executes Store Store barrier so now it needs to execute all the operations 
in store buffer. It waits until it gets acknowledge for B.

Step 10:
P0 receives acknowledge for A so it executes A = 1 to its cache C0 
from the store buffer and changes state to modified.

Step 11:
P0 wants to modify turn so it puts turn = 1 in store buffer 
and sends invalidate.

Step 12:
P0 executes Store Store barrier so now it needs to execute all the operations 
in store buffer. It waits until it gets acknowledge for turn.

Step 13:
P0 receives invalidation for B then P0 puts invalidation of B 
in invalidate queue and sends acknowledge.

Step 14:
P1 receives acknowledge for B and then executes B = 1 in C1 
and set state modified for B.

Step 15:
P1 receives invalidation for turn so it puts it in invalidate queue 
and sends acknowledge.

Step 16:
P1 executes Load Load barrier. It needs to empty invalidate queue. 
There are invalidations of A and turn.

Step 17:
P0 receives acknowledge for turn and executes turn = 1 
from its store buffer and changes state to modified.

Step 18:
P0 executes Load Load barrier. It needs to empty invalidate queue. 
There is an invalidation of B.

Step 19:
P0 invalidates B from invalidate queue and sets the state to invalid for B. 
Now it can continue execution.

Step 20:
P0 has A = 1 and turn = 0 in its cache. 
To execute condition (B == 1 && turn == 1) it needs to read B.

Step 21:
P0 sends a read request for B.

Step 22:
P1 receives a read request for B and returns a response with B = 1 
and changes the state of B to shared.

Step 23:
P0 receives read response with B = 1 and set state shared for B.

Step 24:
P0 has A = 1, B = 1, turn = 0 so condition (B == 1 && turn == 1) 
and return false. P0 enters the critical section.

Step 25:
P1 invalidates A and turn from invalidate queue and 
set their states to invalid. Now it can continue execution.

Step 26:
P1 has B = 1 in C1. 
To execute condition (A == 1 && turn == 0) it needs to read A and turn.

Step 27:
P1 sends two read requests for A and for turn.

Step 28:
P0 receives two requests for A and for turn and returns two responses 
for A: A = 1 and 
for turn: turn = 0 
and change their states to shared.

Step 29:
P1 receives two read responses with A = 1 and turn = 0 
and sets them state shared.

Step 30:
P1 has A = 1, B = 1, turn = 0 so condition (A == 1 && turn == 0) 
and returns true. 
P1 cannot enter the critical section and continues executing while.

…

P0 does work in the critical section
P1 executes while loop

…

Step N:
P0 finishes with its work in the critical section and wants to change A. 
A is in shared state, so it puts A = 0 in store buffer 
and sends invalidate message.

Step N + 1:
P1 receives invalidate message for A and puts the invalidation of A 
in invalidate queue then P1 sends acknowledge.

Step N + 2:
P1 continues executing while loop with the old value of A.

Step N + 3:
P0 receives acknowledge and executes A = 0 from store buffer 
then sets the state of A to modified.

Step N + 4:
P1 executes invalidation of A from invalidate queue 
and set state of A to invalid. So it doesn’t have A in its cache C1.

Step N + 5:
P1 sends a read message for A.

Step N + 6:
P0 receives the read message for A and changes the state of A to shared 
then returns a response with A = 0.

Step N + 7:
P1 receives a response with A = 0 it and stores it in its cache C1 
then sets shared state for A.

Step N + 8:
P1 has A = 0, B = 1, turn = 0 in C1. 
So condition (A == 1 && turn == 0) returns false. 
P1 enters the critical section.

As you can see it takes a lot of steps to organize the cross-processor execution of Peterson’s algorithm and you can guess that barrier is not so cheap operation for the CPUs. We employed two barriers Store Store and Load Load. The first one operates with store buffer and the second one operates with invalidate queue. The names abstract barriers Store Store barrier and Load Load barrier commands have been chosen because the real commands are different for the processor architecture and for the platform. There are also other types of barriers existing that we didn’t use in the Peterson lock example.

Let’s see all types of barriers we can encounter:

Load Load — a familiar barrier we used in the article it operates only with invalidate queue (empties it) and guarantees an order for load instructions so the operations before the barrier will not interfere with the scope after it.

Store Store — the barrier we used above, works with the store buffer and provides an order for store (write) instructions so the operations before the barrier will not mix up with the operations after it.

Load Store — barrier that guarantees that load operations before the barrier will not be executed after it and store operations after the barrier will not be executed before it, considered to be a light barrier.

Store Load — a rare barrier that makes an order between preceding store operations and following load operations. It is considered to be the heaviest of all the barriers.

Architecture specifics

In the different CPU architectures, the memory orderings are organized differently. If in x86 processors the ordering is pretty strong and in ARM architectures it’s actually opposite the memory model there is more relaxed. It’s important to know due to Apple's move to ARM machines.

                                          ARMv7     x86
Loads can be reordered after loads        Yes       No
Loads can be reordered after stores       Yes       No
Stores can be reordered after stores      Yes       No
Stores can be reordered after loads       Yes       Yes
Atomic can be reordered with loads        Yes       No
Atomic can be reordered with stores       Yes       No
Dependent loads can be reordered          No        No
Incoherent instruction cache/pipeline     Yes       Yes

So you can see that ARM can reorder instructions almost in any direction compared to x86. That makes ARM more performant because of optimizations processor can make but it’s more complicated to manage the correctness of code in such a relaxed memory model. Important to add that in ARMv8 the memory model was simplified [4]. But we don’t need to specify all the barriers manually instead we can use the memory model that Swift brought from C++. So the compiler will set memory barriers for us. We will discuss the memory model in the next part.

Performance

There are a lot of discussions about the usage of lock-free algorithms. First of all, it’s very difficult to implement them correctly and debug to find a mistake as you probably understand looking at the number of problems with the instructions reordering and etc. Another issue is that atomic operations can be heavy and slower than nonatomic operations. So sometimes there is no gain in the usage of lock-free algorithms especially if they are using atomic operations all over the algorithm instead of the critical section.

Conclusion

Today we didn’t write a single line in Swift but the goal of this article was an explanation of how a system with multiple CPUs shares memory. In the next article, we will start with the memory model for Swift and will introduce atomic RMW operations because they are fundamental parts of the Lock-free algorithms. Also, we will discuss what is ABA problem and how we potentially solve it.

Sources

Maurice Herlihy, Nir Shavit, Victor Luchangco, Michael Spear. The Art of Multiprocessor Programming.
MIT 6.172 Performance Engineering of Software Systems. Charles Leiserson. Synchronization Without Locks.
Ben H. Juurlink. Multicore Architectures course.
ARMv8 Memory model.
Memory Barriers: a Hardware View for Software Hackers.