-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathatomic.html
354 lines (292 loc) · 15.8 KB
/
atomic.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
<html>
<head>
<meta charset="UTF-8">
<title>Atomic operations</title>
<style>
h1 {
font-size: 28px;
font-style: italic;
margin-bottom: 0px;
text-align: center;
}
h2 {
padding-top: 20px;
font-size: 20px;
}
h3 {
font-size: 17px;
margin-top: 35px;
margin-bottom: 3px;
color: #202020;
}
h4 {
font-size: 16px;
margin-top: 20px;
margin-bottom: 3px;
color: #202020;
}
p {
margin-top: 8px;
margin-bottom: 8px;
}
ul {
margin-top: 10px;
margin-bottom: 10px;
}
li {
margin-bottom: 5px;
}
table {
margin-top: 10px;
margin-bottom: 10px;
border-collapse: collapse;
}
td {
border: 1px solid;
text-align: center;
padding: 5px;
}
#left_column, #right_column {
float: left;
width: 50%;
}
#left_column > div, #right_column > div {
margin: 60px;
}
.ref::before {
content: "See: ";
}
.ref {
display: block;
color: #202020;
font-style: italic;
font-size: 90%;
}
.code {
font-style: normal;
font-family: monospace;
padding: 2px;
background: #ddd;
border-radius: 2px;
/* This prevents the .code blocks from changing size when I change whether they contain <sup>-stuff at runtime */
display: inline-block;
height: 2.6ex;
}
.note {
font-style: italic;
font-size: 90%;
}
.code-table {
margin-top: 5px;
margin-left: 40px;
}
.code-table td {
vertical-align: top;
text-align: left;
border: none;
}
.code-table pre {
margin: 0px;
padding: 5px;
padding-right: 10px;
margin-right: 20px;
background: #ddd;
border-radius: 4px;
}
</style>
</head>
<body>
<h1>Atomic operations cheatsheet</h1>
<div id="left_column"><div>
<h2>What does the processor do?</h2>
<h3>Load/store reordering</h3>
In very broad strokes:
<ul>
<li>On ARM, loads and stores can be reordered arbitrarily.</li>
<li>On x64, the only allowed reordering is doing loads from later instructions before stores from earlier instructions complete.</li>
</ul>
Some examples:
<ul>
<li>
<p>Loads can be reordred to run before preceding stores.
Both on x64 and ARM it is possible to get x=0 and y=0.</p>
<table class="code-table"><tr>
<td>
Thread 1
<pre>
*a = 1;
int x = *b;
</pre>
</td>
<td>
Thread 2
<pre>
*b = 1;
int y = *a;
</pre>
</td>
</tr></table>
</li>
<li>
<p>On x64, stores are not re-ordered with other stores, loads are not re-ordered with other loads.
Thread 2 sees either x=0, y=0 or x=1, y=1.</p>
<p>On ARM, it is possible to get x=1, y=0 or x=0, y=1. A <span class="code">DMB</span> barrier is
needed between the instruction pairs in <i>both</i> threads.</p>
<table class="code-table"><tr>
<td>
Thread 1
<pre>
*a = 1;
*b = 1;
</pre>
</td>
<td>
Thread 2
<pre>
int x = *b;
int y = *a;
</pre>
</td>
</tr></table>
</li>
<li>Within one thread, dependencies are tracked, so loads and reads to the same location can't be
reoredred relative to each other.</li>
</ul>
<span class="ref">AMD x64, Vol. 2, section 7.2.; ARMv8-A, section B2.3.12; ARMv8-A, section K13; <a href="https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf">A Tutorial Introduction to the ARM and POWER Relaxed Memory Models</a>.</span>
<h3>Barriers</h3>
On x64, these barriers are available (but they are rarely needed, and I haven't found anybody posting an
example where anything except <span class="code">MFENCE</span> is needed):
<ul>
<li><span class="code">LFENCE</span> — Loads can't be reordered across the barrier.</li>
<li><span class="code">SFENCE</span> — Stores can't be reordered across the barrier.</li>
<li><span class="code">MFENCE</span> — Neither stores nor loads can be reordered across the barrier.</li>
<li>Instructions with the <span class="code">LOCK</span> prefix act as a <span class="code">MFENCE</span>.</li>
</ul>
On ARM, these barriers are available. Typically only <span class="code">DMB</span> is needed in user-space code.
<ul>
<li><span class="code">DMB</span> — Neither stores nor loads can be reorered across the barrier.</li>
<li><span class="code">DSB</span> — All stores and loads must complete before the barrier. Used so stores complete before <span class="code">WFI</span> or <span class="code">WFE</span>.</li>
<li><span class="code">ISB</span> — Flushes the pipeline. Needed when jumping to code dynamically placed in memory.</li>
<li>ARMv8-A introduces additional barriers related to speculation, tracing and profiling.</li>
</ul>
<span class="ref">AMD x64, Vol. 1, section 3.9.2.; ARM: Barrier Litmus Tests and Cookbook; <a href="https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/">Memory barriers are like source control operations</a>; <a href="https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/">Who ordered memory fences on x86?</a>.</span>
<h3>Single copy atomicity</h3>
<ul>
<li>Single aligned stores are atomic on both x64 and ARM architectures.</li>
<li>On x64 with AVX, aligned 16-byte stores are atomic.</li>
<li>On x64, single stores within an 8-byte aligned region are atomic.</li>
</ul>
<span class="ref">AMD x64, Vol. 2, section 7.3.2.; ARMv8-A, section B2.2.</span>
<h3>Special CPU instructions</h3>
<p>On x64, the <span class="code">LOCK</span> prefix allows some load-modify-store instructions to be performed atomically.
This includes
<span class="code">ADD</span>, <span class="code">SUB</span>,
<span class="code">INC</span>, <span class="code">DEC</span>,
<span class="code">AND</span>, <span class="code">OR</span>,
<span class="code">CMPCXHG</span>, and <span class="code">CMPXCHG16B</span>.
The <span class="code">XCHG</span> instruction implicitly has a lock-prefix if it accesses memory.</p>
<p>On ARM, starting with v7-M and v7-A, there are store/load-exclusive instructions, which can be used to
write load-compare-store-loops. First you load, then you modify in register, then you store, and the store
fails if somebody else stored between your load and store.
Note that store/load-exclusive instructions on their own don't introduce memory barriers.</p>
<p>On ARMv7, the instructions are <span class="code">LDREX</span> and <span class="code">STREX</span>.
There also is <span class="code">LDREXD</span>/<span class="code">STREXD</span> for 64-bit register atomic updates.
On ARMv8, they are <span class="code">LDXR</span> and <span class="code">STXR</span>.
There also is <span class="code">LDXP</span>/<span class="code">STXP</span> for 128-bit atomic updates.</p>
<p>Starting with ARMv-8, there are load-acquire instructions with <span class="code">A</span>-suffix
and store-release instructions <span class="code">L</span>-suffix. There are both regular variants
(e.g. <span class="code">LDAR</span>) and execlusive variants (e.g. <span class="code">STLXR</span>).
These act as one-way barriers. Memory accesses after an acquire-barrier can't complete before the
barrier. Memory accesses before a release-barrier must complete before the barrier.</p>
<p>Additionally, there is the processor-consistent acquire <span class="code">AP</span>-suffix (e.g.
<span class="code">LDAPR</span>). I haven't understood what this is for yet.</p>
<span class="ref">AMD x64, Vol. 3, section 1.2.5; ARMv7-M, section A3.4; ARMv8-A, section B2.3.12; <a href="https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions">Load-Acquire and Store-Release instructions</a>; <a href="https://devblogs.microsoft.com/oldnewthing/20220812-00/?p=106968">Old New Thing on ARM barriers</a>.</span>
<h3>Brief ARM architecture overview</h3>
<p>In general, the memory model rules apply to all ARM architecture versions. However, the newer architectures
have more instructions to support multithreaded programming. <i>My understanding is that on older architectures,
pipelines were smaller and thus full barriers were less expensive relative to other instructions, so there was
less of a need for more advanced synchronization mechanisms.</i></p>
<table>
<tr><td rowspan="3">32-bit,<br>microcontrollers</td><td>ARMv6-M</td><td>Cortex-M0 / M0+ / M1</td></tr>
<tr><td>ARMv7-M</td><td>Cortex-M3 / M4 / M7</td></tr>
<tr><td>ARMv8-M</td><td>Cortex-M33 / M35P / M55 / M85</td></tr>
<tr><td rowspan=1>32-bit,<br>application</td><td>ARMv7-A</td><td>Cortex-A5 / A7 / A8 / A9 / A12 / A15 / A17</td></tr>
<tr><td rowspan=3>64-bit,<br>application</td><td>ARMv8-A</td><td>Cortex-A32<sup>*</sup> / A34 / A35 / A53 / A57 / A72 / A73</td></tr>
<tr><td>ARMv8.2-A</td><td>Cortex-A55 / A65 / A75 / A76 / A77 / A78</td></tr>
<tr><td>ARMv9-A</td><td>Cortex-A510 / A710 / A715</td></tr>
</table>
<p><i>*All ARMv8-A cortex cores support 64-bit, except the Cortex-A32 which is 32-bit only.</i></p>
<p>The Raspberry Pi 2 is a Cortex-A7 (ARMv7-A), the Raspberry Pi 3 is a Cortex-A53 (ARMv8-A), and the Raspberry Pi 4 is a Cortex-A72 (ARMv8-A).</p>
<span class="ref"><a href="https://en.wikipedia.org/wiki/ARM_Cortex-M#Instruction_sets">Wikipedia: Cortex-M</a> and <a href="https://en.wikipedia.org/wiki/ARM_Cortex-A#Instruction_sets">Cortex-A</a></span>
</div></div>
<div id="right_column"><div>
<h2>What does the compiler do?</h2>
<h3>Volatile</h3>
<p>In C and C++, the <span class="code">volatile</span> keyword on a variable prevents the compiler from removing stores or loads
to that variable. Commonly used to prevent the compiler from optimizing <span class="code">while (x)</span> to <span class="code">while (1)</span>.</p>
<p>This only helps prevent compiler optimizations. It does not cause any barriers or special instructions to be generated. All the
caveats of processor behaviour still apply.</p>
<h3><stdatomic.h> in C11 and <atomic> in C++11</h3>
<p>See <a href="https://en.cppreference.com/w/cpp/thread#Atomic_operations">cppreference (C++)</a> and <a href="https://en.cppreference.com/w/c/thread#Atomic_operations">cppreference (C)</a>.</p>
<p>Using these libraries prevent both compiler optimizations, insert necessary barrier instructions and generate
special instructions as appropriate. You don't need to declare variables as <span class="code">volatile</span> when
using either atomic functions or types.</p>
<p>Generally, one doesn't have to worry about specific CPU details to write correct code when using these libraries,
though knowing about CPU details can help write more performant code.</p>
<p>In C, there are <a href="file:///C:/dev/docs/cppreference/reference/en/c/language/atomic.html">atomic types</a>, e.g. <span class="code">atomic_int</span>.
For these types, load-modify-store operators (e.g. ++ or +=) are atomic. But you are not able to explicitly specify memory ordering.</p>
<p>In C++, the atomic types are defined as having overloaded operators to provide the same operations as for the types in C.
The atomic functions take <span class="code">std::atomic<T> *</span> as a parameter, so you either need to declare your
variables as such, or cast when calling atomic functions. <i>This prevents accidentally performing atomic operations on
"non-atomic" variables (non-atomic in quotes, because atomicity really is a property of the operation, not the variable).</i></p>
<p>Both in C and C++, the atomic functions allow specifying a memory order, e.g. <span class="code">memory_order_acquire</span>.
See <a href="https://preshing.com/20120913/acquire-and-release-semantics/">this blog post</a> for more details.</p>
<table class="code-table"><tr><td>
<h4>Release + Acquire</h4>
<pre>
// Thread A:
memcpy(shared_data, ...);
atomic_store_explicit(pointer, value, memory_order_release);
// Thread B:
int value = atomic_load_explicit(pointer, memory_order_acquire);
memcpy(local_buffer, shared_data);
</pre>
<p><i>All</i> memory writes from thread A are visible to thread B, since they've synchronized on the same pointer.</p>
<p>Godbolt: <a href="https://godbolt.org/z/sG7f16cT8">https://godbolt.org/z/sG7f16cT8</a></p>
<h4>Release + Consume</h4>
<p>Behaves like release+acquire, but in thread B, only loads which are dependent on the atomically loaded value
are guaranteed to see stores from thread A. Compiler and processor are free to reorder other loads/stores.</p>
<p>The idea is that this allows the compiler to skip a barrier since it knows that most processors track dependencies
when performing memory operations.</p>
<p>In practice, apparently all compilers treat <span class="code">_consume</span> like <span class="code">_acquire</span>.
See e.g. <a href="https://stackoverflow.com/a/65353549">https://stackoverflow.com/a/65353549</a></p>
</td><td>
<h4>Relaxed</h4>
<pre>
int value = atomic_load_explicit(pointer, memory_order_relaxed);
</pre>
<p>The compiler can reorder this arbitrarily, and doesn't generate any barriers</p>
<p>Since x64 and ARM have single copy atomicity, this just works as a hint to the compiler to generate a single load instruction.</p>
<h4>Acquire-release</h4>
<p><span class="code">memory_order_acq_rel</span> can be used with any read-modify-write operation to give both acquire and release semantics.</p>
<h4>Sequentially consistent</h4>
<p><span class="code">memory_order_seq_cst</span> prevents any reordering of stores/loads across the operation.</p>
<p>This is the default ordering for the non-<span class="code">_explicit</span> versions of the functions.</p>
<p><i>I think this is an oversimplification, the specification is much more complicated than this.</i></p>
</td></tr></table>
<h3>Windows</h3>
<p>Windows provides <span class="code">Interlocked</span> functions. For example <a href="https://learn.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange"><span class="code">InterlockedCompareExchange</span></a>.
Additionally, there are intrinsic, which are prefixed with an underscore, e.g. <a href="https://learn.microsoft.com/en-us/cpp/intrinsics/interlockedincrement-intrinsic-functions?view=msvc-170"><spann class="code">_InterlockedIncrement</spann></a>.
In practice, both seem to produce identical code.</p>
<h3>Linux kernel</h3>
<p>The Linux kernel defines it's own set of primitives and helpers. Here are some links.
Use <a href="https://elixir.bootlin.com/linux/latest/source">bootlin</a> to find usage examples.</p>
<ul>
<li><a href="https://docs.kernel.org/core-api/wrappers/atomic_t.html">docs.kernel.org: Atomic types</a>, provides functions <i>simmilar</i>, but not equal to <stdatomic.h>.</li>
<li><a href="https://lwn.net/Articles/695257/">lwn.net: Atomic primitives in the kernel</a>, which also provides some history.</li>
<li><a href="https://docs.kernel.org/RCU/rcu.html">docs.kernel.org: RCU Concepts</a>, which provides protection for mostly-read variables. Also see <a href="https://lwn.net/Articles/262464/">lwn.net: What is RCU, Fundamentally?</a></li>
<li><a href="https://www.kernel.org/doc/Documentation/memory-barriers.txt">kernel.org: Memory barriers</a> and <a href="https://lwn.net/Articles/847481/">lwn.net: Lockless patterns: full memory barriers</a>.</li>
</ul>
</div>
</body>
</html>