Programming Languages

Concurrency: Memory Consistency

Dr. Michael Petter
Winter term 2018
Thread A

```c
void foo(void) {
    a = 1;
    b = 1;
}
```

Thread B

```c
void bar(void) {
    while (b == 0){};
    assert (a==1);
}
```

**Intuition**: the assertion will never fail

⚠️ **Real execution**: given enough tries, the assertion may eventually fail

⇒ in need of defining a *Memory Model*
Memory Models

Memory interactions behave differently in presence of:
- multiple concurrent threads
- data replication in hierarchical and/or distributed memory systems
- deferred communication of updates

Memory Models are a product of negotiating:
- restrictions of freedom of implementation to guarantee race related properties
- establishment of freedom of implementation to enable *program* and *machine model* optimizations

Modern Languages include the memory model in their language definition
Strict Consistency

Motivated by sequential computing, we intuitively implicitly transfer our idea of semantics of memory accesses to concurrent computation. This leads to our idealistic model *Strict Consistency*:

**Definition (Strict consistency)**

Independently of which process reads or writes, the value from the most recent write to a location is observable by reads from the respective location immediately *after* the write occurs.

Although ideally desired, practically not existing

⚠️ absolute global time problematic

⚠️ physically not possible

⇝ strict consistency is too strong to be realistic
Abandoning absolute time

Thread A

```c
void foo(void) {
    a = 1;
    b = 1;
}
```

Thread B

```c
void bar(void) {
    while (b == 0) {};
    assert(a == 1);
}
```

- initial state of `a` and `b` is 0
- A writes `a` before it writes `b`
- B should see `b` go to 1 before executing the `assert` statement
- the `assert` statement should always hold

⇝ here correctness means: writing a 1 to `a` happens before reading a 1 in `b`

Still, any of the following may happen:

⇝ Idea: state correctness in terms of what event may happen before another one
Happend-Before Relation and Diagram
Events in a Distributed System

A process as a series of events [Lam78]: Given a distributed system of processes $P, Q, R, \ldots$, each process $P$ consists of events $p_1, p_2, \ldots$.

Example:

- event $p_i$ in process $P$ happened before $p_{i+1}$
- if $p_i$ is an event that sends a message to $Q$ then there is some event $q_j$ in $Q$ that receives this message and $p_i$ happened before $q_j$
## The Happened-Before Relation

### Definition
If an event $p$ *happened before* an event $q$ then $p \rightarrow q$.

Observe:
- $\rightarrow$ is partial (neither $p \rightarrow q$ or $q \rightarrow p$ may hold)
- $\rightarrow$ is irreflexive ($p \rightarrow p$ never holds)
- $\rightarrow$ is transitive ($p \rightarrow q \land q \rightarrow r$ then $p \rightarrow r$)
- $\rightarrow$ is asymmetric (if $p \rightarrow q$ then $\neg(q \rightarrow p)$)

$\rightarrow$ the $\rightarrow$ relation is a *strict partial order*
Let \( a \not\rightarrow b \) abbreviate \( \neg (a \rightarrow b) \).

**Definition**

Two distinct events \( p \) and \( q \) are said to be *concurrent* if \( p \not\rightarrow q \) and \( q \not\rightarrow p \).

\[
P_1 \rightarrow P_2 \rightarrow P_3 \rightarrow P_4
\]

\[
Q_1 \rightarrow Q_2 \rightarrow Q_3 \rightarrow Q_4 \rightarrow Q_5 \rightarrow Q_6 \rightarrow Q_7
\]

\[
R_1 \rightarrow R_2 \rightarrow R_3 \rightarrow R_4
\]

- \( p_1 \rightarrow r_4 \) in the example
- \( p_3 \) and \( q_3 \) are, in fact, concurrent since \( p_3 \not\rightarrow q_3 \) and \( q_3 \not\rightarrow p_3 \).
Ordering

Let $C$ be a *logical clock* i.e. $C$ assigns a *globally unique* time-stamp $C(p)$ to each event $p$.

**Definition (Clock Condition)**

Function $C$ satisfies the *clock condition* if for any events $p, q$

$$p \rightarrow q \implies C(p) < C(q)$$

For a distributed system the *clock condition* holds iff:

1. $p_i$ and $p_j$ are events of $P$ and $p_i \rightarrow p_j$ then $C(p_i) < C(p_j)$
2. $p$ is the sending of a message by process $P$ and $q$ is the reception of this message by process $Q$ then $C(p) < C(q)$

$\Rightarrow$ a logical clock $C$ that satisfies the clock condition describes a *total order* $a < b$ (with $C(a) < C(b)$) that *embeds* the strict partial order $\rightarrow$

The *set* defined by all $C$ that satisfy the clock condition is exactly the *set* of executions possible in the system.

$\Rightarrow$ use the process model and $\rightarrow$ to define better consistency model
Defining $C$ Satisfying the Clock Condition

Given:

\[
\begin{array}{c}
\text{\textbf{P}} \\
\text{\textbf{Q}} \\
\text{\textbf{R}}
\end{array}
\]

\[
\begin{array}{cccc}
\text{\textbf{P}} & p_1 & p_2 & p_3 & p_4 \\
\text{\textbf{Q}} & q_1 & q_2 & q_3 & q_4 & q_5 & q_6 & q_7 \\
\text{\textbf{R}} & r_1 & r_2 & r_3 & r_4
\end{array}
\]

\[
\begin{array}{|c|c|c|c|c|}
\hline
\text{e} & p_1 & p_2 & p_3 & p_4 \\
\hline
\text{C(e)} & 1 & 4 & 7 & 12 \\
\hline
\end{array}
\]

\[
\begin{array}{|c|c|c|c|c|c|c|c|}
\hline
\text{e} & q_1 & q_2 & q_3 & q_4 & q_5 & q_6 & q_7 \\
\hline
\text{C(e)} & 2 & 3 & 5 & 6 & 11 & 13 & 14 \\
\hline
\end{array}
\]

\[
\begin{array}{|c|c|c|c|c|}
\hline
\text{e} & r_1 & r_2 & r_3 & r_4 \\
\hline
\text{C(e)} & 8 & 9 & 10 & 15 \\
\hline
\end{array}
\]
We can model concurrency using processes and events:

- there is a \textit{happened-before} relation between the events of each process
- there is a \textit{happened-before} relation between communicating events
- \textit{happened-before} is a strict partial order
- a clock is a total strict order that embeds the \textit{happened-before} partial order
Memory Consistency Models based on the Happened-Before Relation
Idea: use happened-before diagrams to model more relaxed memory models.

Given a path through each of the threads of a program:
- consider the actions of each thread as events of a process
- use more processes to model memory
  - here: one process per variable in memory
- \( \rightarrow \) concisely represent some interleavings

\( \rightarrow \) We establish a model for *Sequential Consistency*. 
Sequential Consistency

Definition (Sequential Consistency Condition [Lam78])

The result of any execution is the same as if the memory operations
of each individual processor appear in the order specified by its program
of all processors joined were executed in some sequential order.

Sequential Consistency applied to Multiprocessor Programs:
Given a program with \( n \) threads,

1. for fixed event sequences \( p_0^1, p_1^1, \ldots \) and \( p_0^2, p_1^2, \ldots \) and \( p_0^n, p_1^n, \ldots \) keeping
the program order,

2. executions obeying the clock condition on the \( p_i^j \),

3. all executions have the same result

Yet, in other words:

1. defines the \textit{execution path} of each thread

2. each execution mentioned in \textit{1} is one \textit{interleaving} of processes

3. declares that the result of running the threads with these interleavings
is always the same.
Sequential Consistency in Multiprocessor Programs:
Given a program with \( n \) threads,

1. for fixed event sequences \( p_0^1, p_1^1, \ldots \) and \( p_0^2, p_1^2, \ldots \) and \( p_0^n, p_1^n, \ldots \) keeping the program order,
2. executions obeying the clock condition on the \( p_j^i \),
3. all executions have the same result

Idea for showing that a system is not sequentially consistent:

- pick a result obtained from a program run on a SC system
- pick an execution \(^1\) and a total ordering of all operations \(^2\)
- add extra processes to model other system components
- the original order \(^2\) becomes a partial order \( \rightarrow \)
- show that total orderings \( C' \) exist for \( \rightarrow \) for which the result differs
Weakening the Model

Observation: more concurrency possible, if we model each memory location separately, i.e. as a different process

Sequential consistency still obeyed:
- the accesses of \texttt{foo} to \texttt{a} occurs before \texttt{b}
- the first two read accesses to \texttt{b} are in parallel to \texttt{a=1}

Conclusion: There is no observable change if accesses to different memory locations can happen in parallel.
Benefits of Sequential Consistency

- concisely represent *all* interleavings that are due to variations in timing
- synchronization using time is uncommon for software

⇝ a good model for correct behaviors of concurrent programs
⇝ program results besides SC results are undesirable (they contain *races*)

Realistic model for simple hardware architectures:

- sequential consistency model suitable for concurrent processors that acquire *exclusive* access to memory
- processors can speed up computation by using *caches* and still made to maintain sequential consistency

Not realistic for elaborate hardware with out-of-order stores:

- what other processors see is determined by complex optimizations to cacheline management

⇝ internal workings of caches
Introducing Caches: The MESI Protocol
The MESI Protocol: States [PP84]

Processors use caches to avoid a costly round-trip to RAM for every memory access.

- programs often access the same memory area repeatedly (e.g. stack)
- keeping a local mirror image of certain memory regions requires bookkeeping about who has the latest copy

Each cache line is in one of the states $M, E, S, I$:

- $I$: it is invalid and is ready for re-use
- $S$: other caches have an identical copy of this cache line, it is shared
- $E$: the content is in no other cache; it is exclusive to this cache and can be overwritten without consulting other caches
- $M$: the content is exclusive to this cache and has furthermore been modified

⇝ the global state of cache lines is kept consistent by sending messages
The MESI Protocol: Messages

Moving data between caches is coordinated by sending messages [McK10]:

- **Read**: sent if CPU needs to read from an address
- **Read Response**: when in state E or S, response to a Read message, carries the data for the requested address
- **Invalidate**: asks others to evict a cache line
- **Invalidate Acknowledge**: reply indicating that a cache line has been evicted
- **Read Invalidate**: like Read + Invalidate (also called “read with intend to modify”)
- **Writeback**: Read Response when in state M, as a side effect noticing main memory about modifications to the cacheline, changing sender’s state to S

We mostly consider messages between processors. Upon Read Invalidate, a processor replies with Read Response/Writeback before the Invalidate Acknowledge is sent.
Consider how the following code might execute:

**Thread A**

```plaintext
a = 1;  // A.1
b = 1;  // A.2
```

**Thread B**

```plaintext
while (b == 0) {};  // B.1
assert (a == 1);   // B.2
```

- In all examples, the initial values of variables are assumed to be 0
- Suppose that `a` and `b` reside in different cache lines
- Assume that a cache line is larger than the variable itself
- We write the content of a cache line as
  - `Mx`: modified, with value `x`
  - `Ex`: exclusive, with value `x`
  - `Sx`: shared, with value `x`
  - `I`: invalid
**Thread A**

```plaintext
a = 1; // A.1
b = 1; // A.2
```

**Thread B**

```plaintext
while (b == 0) {} // B.1
assert(a == 1); // B.2
```

<table>
<thead>
<tr>
<th>statement</th>
<th>CPU A</th>
<th>CPU B</th>
<th>RAM</th>
<th>message</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>a</td>
<td>b</td>
<td>a</td>
<td>b</td>
</tr>
<tr>
<td>A.1</td>
<td>I</td>
<td>I</td>
<td>I</td>
<td>I</td>
</tr>
<tr>
<td></td>
<td>I</td>
<td>I</td>
<td>I</td>
<td>I</td>
</tr>
<tr>
<td></td>
<td>I</td>
<td>I</td>
<td>I</td>
<td>I</td>
</tr>
<tr>
<td>B.1</td>
<td>M1</td>
<td>I</td>
<td>I</td>
<td>I</td>
</tr>
<tr>
<td></td>
<td>M1</td>
<td>I</td>
<td>I</td>
<td>I</td>
</tr>
<tr>
<td></td>
<td>M1</td>
<td>I</td>
<td>E0</td>
<td>0</td>
</tr>
<tr>
<td></td>
<td>M1</td>
<td>I</td>
<td>E0</td>
<td>0</td>
</tr>
<tr>
<td></td>
<td>M1</td>
<td>S0</td>
<td>S0</td>
<td>0</td>
</tr>
<tr>
<td></td>
<td>M1</td>
<td>M1</td>
<td>I</td>
<td>I</td>
</tr>
</tbody>
</table>

*read invalidate* of a from CPU A  
*invalidate ack.* of a from CPU B  
*read response* of a=0 from RAM  
*read* of b from CPU B  
*read response* with b=0 from RAM  
*read invalidate* of b from CPU A  
*read response* of b=0 from CPU B  
*invalidate ack.* of b from CPU B
MESI Example (II)

Thread A

\[
\begin{align*}
a &= 1; & \quad \text{// A.1} \\
b &= 1; & \quad \text{// A.2}
\end{align*}
\]

Thread B

\[
\begin{align*}
\text{while } (b == 0) & \{} \} ; & \quad \text{// B.1} \\
\text{assert}(a == 1); & \quad \text{// B.2}
\end{align*}
\]

<table>
<thead>
<tr>
<th>state-</th>
<th>CPU A</th>
<th>CPU B</th>
<th>RAM</th>
<th>message</th>
</tr>
</thead>
<tbody>
<tr>
<td>ment</td>
<td>a</td>
<td>b</td>
<td>a</td>
<td>b</td>
</tr>
<tr>
<td>B.1</td>
<td>M1</td>
<td>M1</td>
<td>I</td>
<td>I</td>
</tr>
<tr>
<td></td>
<td>M1</td>
<td>M1</td>
<td>I</td>
<td>I</td>
</tr>
<tr>
<td></td>
<td>M1</td>
<td>S1</td>
<td>I</td>
<td>S1</td>
</tr>
<tr>
<td></td>
<td>S1</td>
<td>S1</td>
<td>S1</td>
<td>S1</td>
</tr>
<tr>
<td>B.2</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>A.1</td>
<td>S1</td>
<td>S1</td>
<td>S1</td>
<td>S1</td>
</tr>
<tr>
<td></td>
<td>S1</td>
<td>S1</td>
<td>I</td>
<td>S1</td>
</tr>
<tr>
<td></td>
<td>M1</td>
<td>S1</td>
<td>I</td>
<td>S1</td>
</tr>
</tbody>
</table>

\textit{read} of \( b \) from CPU B
\textit{write back} of \( b=1 \) from CPU A
\textit{read} of \( a \) from CPU B
\textit{write back} of \( a=1 \) from CPU A
\textit{invalidate} of \( a \) from CPU A
\textit{invalidate ack.} of \( a \) from CPU B
Idea: each cache line one process, A caches $b=0$ as E, B caches $a=0$ as E

Observations:
- each memory access must complete before executing next instruction
  $\Rightarrow$ add edge
- second execution of test $b==0$ stays within cache $\Rightarrow$ no traffic
Sequential consistency:
- a characterization of well-behaved programs
- a model for differing speed of execution
- for fixed paths through the threads \textit{and} a total order between accesses to the same variable: executions can be illustrated by happened-before diagram with one process per variable
- MESI cache coherence protocol ensures SC for processors with caches
Introducing Store Buffers: Out-Of-Order Stores
Out-of-Order Execution

⚠️ performance problem: writes always stall

**Thread A**

\[
\begin{align*}
    a &= 1; & \text{ // A.1} \\
    b &= 1; & \text{ // A.2}
\end{align*}
\]

**Thread B**

\[
\begin{align*}
    \text{while} \ (b == 0) \ {} & \{}; & \text{ // B.1} \\
    \text{assert}(a == 1); & \text{ // B.2}
\end{align*}
\]

⇝ **CPU A** should continue executing after \(a=1\);
Store Buffers

⚠️ Abstract Machine Model: defines semantics of memory accesses

- put *each* store into a *store buffer* and continue execution
- Store buffers apply stores in various orders:
  - FIFO (Sparc/x86-\textit{TSO})
  - unordered (Sparc \textit{PSO})
- ⚠️ program order still needs to be observed locally
  - store buffer snoops read channel and
  - on matching address, returns the youngest value in buffer
Definition (Total Store Order)

1. The store order wrt. memory (\(\sqsubseteq\)) is total
   \[ \forall a,b \in \text{addr}, \ i,j \in \text{CPU} \ (\text{St}_i[a] \sqsubseteq \text{St}_j[b]) \lor (\text{St}_j[b] \sqsubseteq \text{St}_i[a]) \]

2. Stores in program order (\(\leq\)) are embedded into the memory order (\(\sqsubseteq\))
   \[ \text{St}_i[a] \leq \text{St}_i[b] \Rightarrow \text{St}_i[a] \sqsubseteq \text{St}_i[b] \]

3. Loads preceding an operation (wrt. program order \(\leq\)) are embedded into the memory order (\(\sqsubseteq\))
   \[ \text{Ld}_i[a] \leq \text{Op}_i[b] \Rightarrow \text{Ld}_i[a] \sqsubseteq \text{Op}_i[b] \]

4. A load’s value is determined by the latest write as observed by the local CPU
   \[ \text{val(} \text{Ld}_i[a] \text{)} = \text{val(} \text{St}_j[a] \mid \text{St}_j[a] = \max_{\sqsubseteq} (\{\text{St}_k[a] \mid \text{St}_k[a] \sqsubseteq \text{Ld}_i[a]\} \cup \{\text{St}_i[a] \mid \text{St}_i[a] \leq \text{Ld}_i[a]\})) \]

Particularly, one ordering property is not guaranteed:

\[ \text{St}_i[a] \leq \text{Ld}_i[b] \nRightarrow \text{St}_i[a] \sqsubseteq \text{Ld}_i[b] \]

⚠️ Local stores may be observed earlier by local loads than from somewhere else!
Happened-Before Model for TSO

Thread A

```c
a = 1;
printf("%d", b);
```

Thread B

```c
b = 1;
printf("%d", a);
```

Assume cache A contains: a: S0, b: S0, cache B contains: a: S0, b: S0

Memory Consistency

Out-of-Order Execution Stores
The x86 CPUs, powering desktops and servers around the world is a common representative of a TSO Memory Model based CPU.

- FIFO store buffers keep quite strong consistency properties
- The major obstacle to Sequential Consistency is

\[
\text{St}_i[a] \leq \text{Ld}_i[b] \not\Rightarrow \text{St}_i[a] \sqsubseteq \text{Ld}_i[b]
\]

- modern x86 CPUs provide the `mfence` instruction
- `mfence` orders all memory instructions:

\[
\text{Op}_i \leq mfence() \leq \text{Op}_i' \Rightarrow \text{Op}_i \sqsubseteq \text{Op}_i'
\]

- a fence between write and loads gives sequentially consistent CPU behavior (and is as slow as a CPU without store buffer)

\[\Rightarrow \text{ use fences only when necessary}\]
PSO Model: Formal Spec [SI92]

### Definition (Partial Store Order)

1. The store order wrt. memory \((\sqsubseteq)\) is total
   \[
   \forall a,b \in \text{addr} \ i,j \in \text{CPU} \quad (\text{St}_i[a] \sqsubseteq \text{St}_j[b]) \lor (\text{St}_j[b] \sqsubseteq \text{St}_i[a])
   \]

2. Fenced stores in program order \((\leq)\) are embedded into the memory order \((\sqsubseteq)\)
   \[
   \text{St}_i[a] \leq \text{sfence}() \leq \text{St}_i[b] \Rightarrow \text{St}_i[a] \sqsubseteq \text{St}_i[b]
   \]

3. Stores to the same address in program order \((\leq)\) are embedded into the memory order \((\sqsubseteq)\)
   \[
   \text{St}_i[a] \leq \text{St}_i[a]' \Rightarrow \text{St}_i[a] \sqsubseteq \text{St}_i[a]'
   \]

4. Loads preceding another operation (wrt. program order \(\leq\)) are embedded into the memory order \((\sqsubseteq)\)
   \[
   \text{Ld}_i[a] \leq \text{Op}_i[b] \Rightarrow \text{Ld}_i[a] \sqsubseteq \text{Op}_i[b]
   \]

5. A load's value is determined by the latest write as observed by the local CPU
   \[
   \text{val}(\text{Ld}_i[a]) = \text{val}(\text{St}_j[a] \mid \text{St}_j[a] = \max_{\sqsubseteq} \{\text{St}_k[a] \mid \text{St}_k[a] \sqsubseteq \text{Ld}_i[a]\} \cup \{\text{St}_i[a] \mid \text{St}_i[a] \leq \text{Ld}_i[a]\})
   \]

⚠️ Now also stores are not guaranteed to be in order any more:

\[
\text{St}_i[a] \leq \text{St}_i[b] \nRightarrow \text{St}_i[a] \sqsubseteq \text{St}_i[b]
\]

\[\rightarrow\] What about sequential consistency for the whole system?
Happened-Before Model for PSO

Thread A

\[ a = 1; \]
\[ b = 1; \]

Thread B

\[
\text{while } (b == 0) \{ \}\; \text{assert}(a == 1);
\]

Assume cache A contains: a: S0, b: E0, cache B contains: a: S0, b: I
Explicit Synchronization: Write Barrier

Overtaking of messages *may be desirable* and does not need to be prohibited in general.

- generalized store buffers render programs incorrect that assume sequential consistency between *different* CPUs
- whenever a store in front of another operation in one CPU must be observable in this order *by a different CPU*, an explicit *write barrier* has to be inserted
  - a write barrier marks all current store operations in the store buffer
  - the next store operation is only executed when all marked stores in the buffer have completed
Happened-Before Model for Write Barriers

Thread A

```cpp
a = 1;
sfence();
b = 1;
```

Thread B

```cpp
while (b == 0) {};
assert(a == 1);
```

Assume cache A contains: a: S0, b: E0, cache B contains: a: S0, b: I
Further weakening the model: O-o-O Reads
Relaxed Memory Order

Communication of cache updates is still costly:

- a cache-intense computation can fill up store buffers in CPUs
- waiting for invalidation acknowledgements may still happen
- invalidation acknowledgements are delayed on busy caches

- immediately acknowledge an invalidation and apply it later
- put each invalidate message into an invalidate queue
- if a MESI message needs to be sent regarding a cache line in the invalidate queue then wait until the line is invalidated

⚠️ local loads and stores do not consult the invalidate queue

- What about sequential consistency?
**Definition (Relaxed Memory Order)**

1. Fenced memory accesses in program order (\( \leq \)) are embedded into the memory order (\( \leq^\cdot \))

\[
\text{Op}_i[a] \leq \text{mfence()} \leq \text{Op}_i[b] \Rightarrow \text{Op}_i[a] \leq^\cdot \text{Op}_i[b]
\]

2. Stores to the same address in program order (\( \leq \)) are embedded into the memory order (\( \leq^\cdot \))

\[
\text{St}_i[a] \leq \text{St}_i[a]' \Rightarrow \text{St}_i[a] \leq^\cdot \text{St}_i[a]'
\]

3. Operations dependent on a load (wrt. *dependence* \( \rightarrow \)) are embedded in the memory order (\( \leq^\cdot \))

\[
\text{Ld}_i[a] \rightarrow \text{Op}_i[b] \Rightarrow \text{Ld}_i[a] \leq^\cdot \text{Op}_i[b]
\]

4. A load’s value is determined by the latest write as observed by the local CPU

\[
\text{val}(\text{Ld}_i[a]) = \text{val}(\text{St}_j[a] \mid \text{St}_j[a] = \max_{\leq^\cdot} \{\text{St}_k[a] \mid \text{St}_k[a] \leq^\cdot \text{Ld}_i[a]\} \cup \{\text{St}_i[a] \mid \text{St}_i[a] \leq \text{Ld}_i[a]\})
\]

⚠️ Now we need the notion of *dependence* \( \rightarrow \):

- Memory access to the same address:

\[
\text{St}_i[a] \leq \text{Ld}_i[a] \Rightarrow \text{St}_i[a] \rightarrow \text{Ld}_i[a]
\]

- Register reads are dependent on latest register writes:

\[
\text{Op}_i[a]'' = \max_{\leq} (\text{Op}_i[a]' \mid \text{targetreg}(\text{Op}_i[a]')) = \text{srcreg}(\text{Op}_i[b]) \land \text{Op}_i[a]' \leq \text{Op}_i[b]) \Rightarrow \text{Op}_i[a]'' \rightarrow \text{Op}_i[b]
\]

- Stores within branched blocks are dependent on branch conditionals:

\[
(\text{Op}_i[a] \leq \text{St}_i[b]) \land \text{Op}_i[a] \rightarrow \text{condbranch} \leq \text{St}_i[b] \Rightarrow \text{Op}_i[a] \rightarrow \text{St}_i[b]
\]
Happened-Before Model for Invalidate Queues

Thread A:

\[ a = 1; \]
\[ \text{sfence}(); \]
\[ b = 1; \]

Thread B:

\[ \text{while} \ b == 0 \ \{} \ \{ \} ; \]
\[ \text{assert} \ a == 1; \]

Assume cache A contains: a: S0, b: E0, cache B contains: a: S0, b: I

Memory Consistency
Out-of-Order Execution of Loads
Explicit Synchronization: Read Barriers

Read accesses do not consult the invalidate queue.

- might read an out-of-date value
- need a way to establish sequential consistency between writes of other processors and local reads
- insert an explicit *read barrier* before the read access
  - a read barrier marks all entries in the invalidate queue
  - the next read operation is only executed once all marked invalidations have completed
- a read barrier *before* each read gives sequentially consistent read behavior (and is as slow as a system without invalidate queue)

⇒ match each write barrier in one process with a read barrier in another process
Happened-Before Model for Read Barriers

Thread A

\[ a = 1; \]
\[ \text{sfence}(); \]
\[ b = 1; \]

Thread B

\[ \text{while} \ (b == 0) \ {}; \]
\[ \text{lfence}(); \]
\[ \text{assert}(a == 1); \]
Example: The Dekker Algorithm on RMO Systems
Using Memory Barriers: the Dekker Algorithm

Mutual exclusion of \textit{two} processes with busy waiting.

```c
// flag[] is boolean array; and turn is an integer
flag[0] = false;
flag[1] = false;
turn = 0; // or 1

P0:
flag[0] = true;
while (flag[1] == true)
    if (turn != 0) {
        flag[0] = false;
        while (turn != 0) {
            // busy wait
        }
        flag[0] = true;
    }
// critical section
turn = 1;
flag[0] = false;

P1:
flag[1] = true;
while (flag[0] == true)
    if (turn != 1) {
        flag[1] = false;
        while (turn != 1) {
            // busy wait
        }
        flag[1] = true;
    }
// critical section
turn = 0;
flag[1] = false;
```
The Idea Behind Dekker

Communication via three variables:

- \( \text{flag}[i] == \text{true} \) process \( P_i \) wants to enter its critical section
- \( \text{turn} == i \) process \( P_i \) has priority when both want to enter

**P0:**
```
flag[0] = true;
while (flag[1] == true) {
    if (turn != 0) {
        flag[0] = false;
        while (turn != 0) {
            // busy wait
        }
        flag[0] = true;
    }
}
// critical section
```
```
turn = 1;
flag[0] = false;
```

In process \( P_i \):

- if \( P_{1-i} \) does not want to enter, proceed immediately to the critical section
- \( \Rightarrow \) \( \text{flag}[i] \) is a **lock** and may be implemented as such
- if \( P_{1-i} \) also wants to enter, wait for \( \text{turn} \) to be set to \( i \)
- while waiting for \( \text{turn} \), reset \( \text{flag}[i] \) to enable \( P_{1-i} \) to progress
**Problem:** Dekker’s algorithm requires sequential consistency.

**Idea:** insert memory barriers between all variables common to both threads.

```c
P0:
flag[0] = true;
sfence();
while (lfence(), flag[1] == true) {
    if (lfence(), turn != 0) {
        flag[0] = false;
sfence();
        while (lfence(), turn != 0){
            // busy wait
        }
        flag[0] = true;
sfence();
    }
// critical section
turn = 1;
sfence();
flag[0] = false; sfence();
```

- Insert a load memory barrier `lfence()` in front of every read from common variables.
- Insert a write memory barrier `sfence()` after writing a variable that is read in the other thread.
- The `lfence()` of the first iteration of each loop may be combined with the preceding `sfence()` to an `mfence()`.

**Memory Consistency**

**The Dekker Algorithm**
Summary: Relaxed Memory Models

Highly optimized CPUs may use a relaxed memory model:

- reads and writes are not synchronized unless requested by the user
- many kinds of memory barriers exist with subtle differences

⇝ ARM, PowerPC, Alpha, ia-64, even x86 (⇝ SSE Write Combining)

⇝ memory barriers are the “lowest-level” of synchronization
Discussion

Memory barriers reside at the lowest level of synchronization primitives.

Where are they useful?

- when blocking should not de-schedule threads
- when several processes implement automata and coordinate their transitions via common synchronized variables
  ↦ protocol implementations
  ↦ OS provides synchronization facilities based on memory barriers

Why might they not be appropriate?

- difficult to get right, best suited for specific well-understood algorithms
- often synchronization with locks is as fast and easier
- too many fences are costly if store/invalidate buffers are bottleneck
Memory Models and Compilers

Before Optimization

```c
int x = 0;
for (int i=0; i<100; i++) {
    x = 1;
    printf("%d", x);
}
```

After Optimization

```c
int x = 1;
for (int i=0; i<100; i++) {
    printf("%d", x);
}
```

Standard Program Optimizations

comprises *loop-invariant code motion* and *dead store elimination*, e.g.

⚠️ having another thread executing `x = 0;` changes observable behaviour depending on optimizing or not

⇝ Compiler also depends on consistency guarantees
⇝ Demand for Memory Models on language level
Compilers may also reorder store instructions

Write barriers keep the compiler from reordering across

The specification of volatile keeps the C-Compiler from reordering memory accesses to this address

Java-Compilers even generate barriers around accesses to volatile variables
Summary

Learning Outcomes

1. Strict Consistency
2. Happened-before Relation
3. Sequential Consistency
4. The MESI Cache Model
5. TSO: FIFO store buffers
6. PSO: store buffers
7. RMO: invalidate queues
8. Reestablishing Sequential Consistency with memory barriers
9. Dekker’s Algorithm for Mutual Exclusion
Many-Core Machines’ Read Responses congest the bus

In that case: Intel’s *MESIF* (Forward) to reduce communication overhead.

⚠️ But in general, Symmetric multi-processing (SMP) has its limits:
- a memory-intensive computation may cause contention on the bus
- the speed of the bus is limited since the electrical signal has to travel to all participants
- point-to-point connections are faster than a bus, but do not provide possibility of forming consensus

⇝ use a bus locally, use point-to-point links globally: *NUMA*
  - *non-uniform memory access* partitions the memory amongst CPUs
  - a directory states which CPU holds a memory region
  - Interprocess communication between Cache-Controllers (*ccNUMA*):
    onchip on Opteron or in chipset on Itanium
Overhead of NUMA Systems

Communication overhead in a NUMA system.

- Processors in a NUMA system may be fully or partially connected.
- The directory of who stores an address is partitioned amongst processors.

A cache miss that cannot be satisfied by the local memory at $A$:
- $A$ sends a retrieve request to processor $B$ owning the directory
- $B$ tells the processor $C$ who holds the content
- $C$ sends data (or status) to $A$ and sends acknowledge to $B$
- $B$ completes transmission by an acknowledge to $A$

source: [Int09]
Intel.
An introduction to the intel quickpath interconnect.

Leslie Lamport.
Time, Clocks, and the Ordering of Events in a Distributed System.

Paul E. McKenny.
Memory Barriers: a Hardware View for Software Hackers.

Mark S. Papamarcos and Janak H. Patel.
A low overhead coherence solution for multiprocessors with private cache memories.

CORPORATE SPARC International, Inc.

CORPORATE SPARC International, Inc.
The SPARC Architecture Manual (Version 9).