forked from mit-pdos/xv6-riscv-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlock.tex
959 lines (882 loc) · 34.5 KB
/
lock.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
\chapter{Locking}
\label{CH:LOCK}
Most kernels, including xv6, interleave the execution
of multiple activities.
One source of interleaving is multiprocessor hardware:
computers with
multiple CPUs executing independently, such as xv6's RISC-V.
These multiple CPUs share physical RAM,
and xv6 exploits the sharing to maintain
data structures that all CPUs read and write.
This sharing raises the possibility of
one CPU reading a data structure while another
CPU is mid-way through updating it, or even
multiple CPUs updating the same data simultaneously;
without careful design such parallel access is likely
to yield incorrect results or a broken data structure.
Even on a uniprocessor, the kernel may switch the CPU among
a number of threads, causing their execution to be interleaved.
Finally, a device interrupt handler that modifies
the same data as some interruptible code could damage
the data if the interrupt occurs at just the wrong time.
The word
\indextext{concurrency}
refers to situations in which
multiple instruction streams are interleaved,
due to multiprocessor parallelism,
thread switching, or interrupts.
Kernels are full of concurrently-accessed data. For example, two CPUs
could simultaneously call {\tt kalloc}, thereby concurrently
popping from the head of the free list. Kernel designers like to allow
for lots of concurrency, since it can yield increased performance
through parallelism, and increased responsiveness. However, as a result
kernel designers must convince themselves of
correctness despite such concurrency. There are many ways to arrive at
correct code, some easier to reason about than others. Strategies
aimed at correctness under concurrency, and abstractions that support
them, are called \indextext{concurrency control} techniques.
Xv6 uses a number of concurrency control techniques, depending on the
situation; many more are possible. This chapter focuses on a widely
used technique: the \indextext{lock}. A lock provides mutual
exclusion, ensuring that only one CPU at a time can hold the lock. If
the programmer associates a lock with each shared data item, and the
code always holds the associated lock when using an item, then the
item will be used by only one CPU at a time. In this situation, we
say that the lock protects the data item. Although locks are an
easy-to-understand concurrency control mechanism, the downside of
locks is that they can limit performance, because they serialize
concurrent operations.
The rest of this chapter explains why xv6 needs locks, how xv6
implements them, and how it uses them.
% The following is most relevant to lock-free code:
%
% A key observation will be that if you look at some code in
% xv6, you must ask yourself if concurrent code could change
% the intended behavior of the code by modifying data (or hardware resources)
% it depends on.
% You must keep in mind that the compiler may turn a
% single C statement into several machine instructions,
% and that those instructions may execute in a way that is
% interleaved with instructions executing on other CPUs.
% That is, you cannot assume that lines of C code
% on the page are executed atomically.
% Concurrency makes reasoning about correctness difficult.
%%
\section{Races}
%%
\begin{figure}[t]
\center
\includegraphics[scale=0.8]{fig/smp.pdf}
\caption{Simplified SMP architecture}
\label{fig:smp}
\end{figure}
As an example of why we need locks, consider two processes
with exited children calling
{\tt wait} on two different
CPUs. {\tt wait} frees the child's memory.
Thus
on each CPU, the kernel will call {\tt kfree}
to free the children's memory pages. The kernel allocator maintains a linked
list: \lstinline{kalloc()} \lineref{kernel/kalloc.c:/^kalloc/} pops a
page of memory from a list of free pages, and \lstinline{kfree()}
\lineref{kernel/kalloc.c:/^kfree/} pushes a page onto the free list.
For best performance, we might hope that the {\tt kfree}s of the two parent processes
would execute in parallel without either having to wait for the other,
but this would not be correct given xv6's {\tt kfree} implementation.
Figure~\ref{fig:smp} illustrates the setting in more detail: the
linked list of free pages is in memory that is shared by the two CPUs, which
manipulate the list using load and store instructions. (In
reality, the processors have caches, but conceptually
multiprocessor systems behave as if there were a single, shared memory.)
If there were no concurrent
requests, you might implement a list \lstinline{push} operation as
follows:
\begin{lstlisting}[numbers=left,firstnumber=1]
struct element {
int data;
struct element *next;
};
struct element *list = 0;
void
push(int data)
{
struct element *l;
l = malloc(sizeof *l);
l->data = data;
l->next = list; (*@\label{line:next}@*)
list = l; (*@\label{line:list}@*)
}
\end{lstlisting}
\begin{figure}[t]
\center
\includegraphics[scale=0.5]{fig/race.pdf}
\caption{Example race}
\label{fig:race}
\end{figure}
This implementation is correct if executed in isolation.
However, the code is not correct if more than one
copy executes concurrently.
If two CPUs execute
\lstinline{push}
at the same time,
both might execute line~\ref{line:next} as shown in Fig~\ref{fig:smp},
before either executes line~\ref{line:list}, which results
in an incorrect outcome as illustrated by Figure~\ref{fig:race}.
There would then be two
list elements with
\lstinline{next}
set to the former value of
\lstinline{list}.
When the two assignments to
\lstinline{list}
happen at line ~\ref{line:list},
the second one will overwrite the first;
the element involved in the first assignment
will be lost.
The lost update at line~\ref{line:list} is an example of a
\indextext{race}.
A race is a situation in which a memory location is accessed
concurrently, and at least one access is a write.
A race is often a sign of a bug, either a lost update
(if the accesses are writes) or a read of
an incompletely-updated data structure.
The outcome of a race depends on
the machine code generated by the compiler,
the timing of the two CPUs involved,
and how their memory operations are ordered by the memory system,
which can make race-induced errors difficult to reproduce
and debug.
For example, adding print statements while debugging
\lstinline{push}
might change the timing of the execution enough
to make the race disappear.
The usual way to avoid races is to use a lock.
Locks ensure
\indextext{mutual exclusion},
so that only one CPU at a time can execute
the sensitive lines of
\lstinline{push};
this makes the scenario above impossible.
The correctly locked version of the above code
adds just a few lines (highlighted in yellow):
\begin{lstlisting}[numbers=left,firstnumber=6]
struct element *list = 0;
(*@\hl{struct lock listlock;}@*)
void
push(int data)
{
struct element *l;
l = malloc(sizeof *l); (*@\label{line:malloc}@*)
l->data = data;
(*@\hl{acquire(\&listlock);} @*)
l->next = list; (*@\label{line:next1}@*)
list = l; (*@\label{line:list1}@*)
(*@\hl{release(\&listlock)}; @*)
}
\end{lstlisting}
The sequence of instructions between
\lstinline{acquire}
and
\lstinline{release}
is often called a
\indextext{critical section}.
The lock is typically said to be protecting
\lstinline{list}.
When we say that a lock protects data, we really mean
that the lock protects some collection of invariants
that apply to the data.
Invariants are properties of data structures that
are maintained across operations.
Typically, an operation's correct behavior depends
on the invariants being true when the operation
begins. The operation may temporarily violate
the invariants but must reestablish them before
finishing.
For example, in the linked list case, the invariant is that
\lstinline{list}
points at the first element in the list
and that each element's
\lstinline{next}
field points at the next element.
The implementation of
\lstinline{push}
violates this invariant temporarily: in line~\ref{line:next1},
\lstinline{l}
points
to the next list element, but
\lstinline{list}
does not point at
\lstinline{l}
yet (reestablished at line~\ref{line:list1}).
The race we examined above
happened because a second CPU executed
code that depended on the list invariants
while they were (temporarily) violated.
Proper use of a lock ensures that only one CPU at a time
can operate on the data structure in the critical section, so that
no CPU will execute a data structure operation when the
data structure's invariants do not hold.
You can think of a lock as
\indextext{serializing}
concurrent critical sections so that they run one at a time,
and thus preserve invariants (assuming the critical sections
are correct in isolation).
You can also think of critical sections guarded by the same lock as being
atomic with respect to each other,
so that each sees only the complete set of
changes from earlier critical sections, and never sees
partially-completed updates.
Though useful for correctness, locks inherently
limit performance. For example, if two processes call {\tt kfree}
concurrently, the locks will serialize the two critical sections,
so that there is no
benefit from running them on different CPUs. We say that multiple
processes \indextext{conflict} if they want the same lock at the same
time, or that the lock experiences \indextext{contention}. A major
challenge in kernel design is avoidance of lock contention
in pursuit of parallelism. Xv6 does
little of that, but sophisticated kernels organize
data structures and algorithms specifically to avoid lock contention. In the list
example, a kernel may maintain a separate free list per CPU and only touch
another CPU's free list if the current CPU's list is empty and it must steal
memory from another CPU. Other use cases may require more
complicated designs.
The placement of locks is also important for performance.
For example, it would be correct to move
\lstinline{acquire}
earlier in
\lstinline{push},
before line~\ref{line:malloc}.
But this would likely reduce performance because then the calls
to
\lstinline{malloc}
would be serialized.
The section ``Using locks'' below provides some guidelines for where to insert
\lstinline{acquire}
and
\lstinline{release}
invocations.
%%
\section{Code: Locks}
%%
Xv6 has two types of locks: spinlocks and sleep-locks.
We'll start with spinlocks.
Xv6 represents a spinlock as a
\indexcode{struct spinlock}
\lineref{kernel/spinlock.h:/struct.spinlock/}.
The important field in the structure is
\lstinline{locked},
a word that is zero when the lock is available
and non-zero when it is held.
Logically, xv6 should acquire a lock by executing code like
\begin{lstlisting}[numbers=left,firstnumber=21]
void
acquire(struct spinlock *lk) // does not work!
{
for(;;) {
if(lk->locked == 0) { (*@\label{line:testlocked}@*)
lk->locked = 1; (*@\label{line:assign}@*)
break;
}
}
}
\end{lstlisting}
Unfortunately, this implementation does not
guarantee mutual exclusion on a multiprocessor.
It could happen that two CPUs simultaneously
reach line~\ref{line:testlocked}, see that
\lstinline{lk->locked}
is zero, and then both grab the lock by executing line~\ref{line:assign}.
At this point, two different CPUs hold the lock,
which violates the mutual exclusion property.
What we need is a way to
make lines \ref{line:testlocked} and \ref{line:assign} execute as an
\indextext{atomic}
(i.e., indivisible) step.
Because locks are widely used,
multi-core processors usually provide instructions that
implement an atomic version of
lines~\ref{line:testlocked} and \ref{line:assign}.
On the RISC-V this instruction is
\lstinline{amoswap r, a}.
\lstinline{amoswap}
reads the value at the memory address {\tt a},
writes the contents of register {\tt r} to that address,
and puts the value it read into {\tt r}.
That is, it swaps the contents of the register and the memory address.
It performs this sequence atomically, using special
hardware to prevent any
other CPU from using the memory address between the read and the write.
Xv6's
\indexcode{acquire}
\lineref{kernel/spinlock.c:/^acquire/}
uses the portable C library call
\lstinline{__sync_lock_test_and_set},
which boils down to the
\lstinline{amoswap}
instruction;
the return value is the old (swapped) contents of
\lstinline{lk->locked}.
The
\lstinline{acquire}
function wraps the swap in a loop, retrying (spinning) until it has
acquired the lock.
Each iteration swaps one into
\lstinline{lk->locked}
and checks the previous value;
if the previous value is zero, then we've acquired the
lock, and the swap will have set
\lstinline{lk->locked}
to one.
If the previous value is one, then some other CPU
holds the lock, and the fact that we atomically swapped one into
\lstinline{lk->locked}
didn't change its value.
Once the lock is acquired,
\lstinline{acquire}
records, for debugging, the CPU
that acquired the lock.
The
\lstinline{lk->cpu}
field is protected by the lock
and must only be changed while holding the lock.
The function
\indexcode{release}
\lineref{kernel/spinlock.c:/^release/}
is the opposite of
\lstinline{acquire}:
it clears the
\lstinline{lk->cpu}
field
and then releases the lock.
Conceptually, the release just requires assigning zero to
\lstinline{lk->locked}.
The C standard allows compilers to implement an assignment
with multiple store instructions,
so a C assignment might be non-atomic with respect
to concurrent code.
Instead,
\lstinline{release}
uses the C library function
\lstinline{__sync_lock_release}
that performs an atomic assignment.
This function also boils down to a RISC-V
\lstinline{amoswap}
instruction.
%%
\section{Code: Using locks}
%%
Xv6 uses locks in many places to avoid races.
As described above,
\lstinline{kalloc}
\lineref{kernel/kalloc.c:/^kalloc/}
and
\lstinline{kfree}
\lineref{kernel/kalloc.c:/^kfree/}
form a good example.
Try Exercises 1 and 2 to see what happens if those
functions omit the locks.
You'll likely find that it's difficult to trigger incorrect
behavior, suggesting that it's hard to reliably test whether code
is free from locking errors and races.
Xv6 may well have as-yet-undiscovered races.
A hard part about using locks is deciding how many locks
to use and which data and invariants each lock should protect.
There are a few basic principles.
First, any time a variable can be written by one CPU
at the same time that another CPU can read or write it,
a lock should be used to keep the two
operations from overlapping.
Second, remember that locks protect invariants:
if an invariant involves multiple memory locations,
typically all of them need to be protected
by a single lock to ensure the invariant is maintained.
The rules above say when locks are necessary but say nothing about when locks
are unnecessary, and it is important for efficiency not to lock too much,
because locks reduce parallelism. If parallelism isn't important, then one
could arrange to have only a single thread and not worry about locks. A simple
kernel can do this on a multiprocessor by having a single lock that must be
acquired on entering the kernel and released on exiting the kernel (though
blocking system calls such as pipe reads or
\lstinline{wait}
would pose a problem). Many uniprocessor operating systems have been converted to
run on multiprocessors using this approach, sometimes called a ``big
kernel lock,'' but the approach sacrifices parallelism: only one
CPU can execute in the kernel at a time. If the kernel does any heavy
computation, it would be more efficient to use a larger set of more
fine-grained locks, so that the kernel could execute on multiple CPUs
simultaneously.
As an example of coarse-grained locking, xv6's \lstinline{kalloc.c}
allocator has a single free list protected by a single lock. If
multiple processes on different CPUs try to allocate pages at the same
time, each will have to wait for its turn by spinning in {\tt
acquire}. Spinning wastes CPU time, since it's not useful work.
If contention for the lock wasted a significant fraction of CPU time,
perhaps performance could be improved by changing the allocator design
to have multiple free lists, each with its own lock, to allow truly
parallel allocation.
As an example of fine-grained locking, xv6 has a separate lock for
each file, so that processes that manipulate different files can often
proceed without waiting for each other's locks. The file locking
scheme could be made even more fine-grained if one wanted to allow
processes to simultaneously write different areas of the same file.
Ultimately lock granularity decisions need to be driven by performance
measurements as well as complexity considerations.
As subsequent chapters explain each part of xv6, they
will mention examples of xv6's use of locks
to deal with concurrency.
As a preview,
Figure~\ref{fig:locktable}
lists all of the locks in xv6.
\begin{figure}[t]
\center
\begin{tabular}{ll}
{\bf Lock} & {\bf Description} \\
\midrule
bcache.lock & Protects allocation of block buffer cache entries \\
cons.lock & Serializes access to console hardware, avoids intermixed output \\
ftable.lock & Serializes allocation of a struct file in file table \\
itable.lock & Protects allocation of in-memory inode entries \\
vdisk\_lock & Serializes access to disk hardware and queue of DMA descriptors \\
kmem.lock & Serializes allocation of memory \\
log.lock & Serializes operations on the transaction log \\
pipe's pi->lock & Serializes operations on each pipe \\
pid\_lock & Serializes increments of next\_pid \\
proc's p->lock & Serializes changes to process's state \\
wait\_lock & Helps wait avoid lost wakeups \\
tickslock & Serializes operations on the ticks counter \\
inode's ip->lock & Serializes operations on each inode and its content \\
buf's b->lock & Serializes operations on each block buffer \\
\end{tabular}
\caption{Locks in xv6}
\label{fig:locktable}
\end{figure}
%%
\section{Deadlock and lock ordering}
%%
If a code path through the kernel must hold several locks at the same time, it is
important that all code paths acquire those locks in the same order. If
they don't, there is a risk of \indextext{deadlock}. Let's say two code paths in
xv6 need locks A and B, but code path 1 acquires locks in the order A
then B, and the other path acquires them in the order B then A.
Suppose thread T1 executes code path 1 and acquires lock A,
and thread T2 executes code path 2 and acquires lock B.
Next T1 will try to acquire lock B, and T2 will try to acquire lock A.
Both acquires will block indefinitely, because in both cases the
other thread holds the needed lock, and won't release it until
its acquire returns.
To avoid such deadlocks, all code paths must acquire
locks in the same order. The need for a global lock acquisition order
means that locks are effectively part of each function's specification:
callers must invoke functions in a way that causes locks to be acquired
in the agreed-on order.
Xv6 has many lock-order chains of length two involving
per-process locks
(the lock in each
\lstinline{struct proc})
due to the way that
\lstinline{sleep}
works (see Chapter~\ref{CH:SCHED}).
For example,
\lstinline{consoleintr}
\lineref{kernel/console.c:/^consoleintr/}
is the interrupt routine which handles typed characters.
When a newline arrives, any process that is waiting for
console input should be woken up.
To do this,
\lstinline{consoleintr}
holds
\lstinline{cons.lock}
while calling
\indexcode{wakeup},
which acquires
the waiting process's lock in order to wake it up.
In consequence, the global deadlock-avoiding
lock order includes the rule that
\lstinline{cons.lock}
must be acquired before any process lock.
The file-system code contains xv6's longest lock chains.
For example, creating a file requires simultaneously
holding a lock on the directory, a lock on the new file's inode,
a lock on a disk block buffer,
the disk driver's \lstinline{vdisk_lock},
and
the calling process's \lstinline{p->lock}.
To avoid deadlock, file-system code always acquires locks in the order
mentioned in the previous sentence.
Honoring a global deadlock-avoiding order can be surprisingly
difficult. Sometimes the lock order conflicts with logical program
structure, e.g., perhaps code module M1 calls module M2, but the lock
order requires that a lock in M2 be acquired before a lock in M1.
Sometimes the identities of locks aren't known in advance, perhaps
because one lock must be held in order to discover the identity
of the lock to be acquired next. This kind of situation arises in the
file system as it looks up successive components in a path name, and
in the code for {\tt wait} and {\tt exit} as they search the table of
processes looking for child processes. Finally, the danger of deadlock
is often a constraint on how fine-grained one can make a locking
scheme, since more locks often means more opportunity for deadlock.
The need to avoid deadlock is often a major factor in kernel
implementation.
\section{Re-entrant locks}
It might appear that some deadlocks and lock-ordering challenges could
be avoided by using \indextext{re-entrant locks}, which are also
called \indextext{recursive locks}. The idea is that if the lock is held
by a process and if that process attempts to acquire the lock again,
then the kernel could just allow this (since the process already has
the lock), instead of calling panic, as the xv6 kernel does.
It turns out, however, that re-entrant locks make it harder to reason
about concurrency: re-entrant locks break the intuition that locks
cause critical sections to be atomic with respect to other critical
sections. Consider the following two functions \lstinline{f} and
\lstinline{g}:
\begin{lstlisting}
struct spinlock lock;
int data = 0; // protected by lock
f() {
acquire(&lock);
if(data == 0){
call_once();
h();
data = 1;
}
release(&lock);
}
g() {
aquire(&lock);
if(data == 0){
call_once();
data = 1;
}
release(&lock);
}
\end{lstlisting}
Looking at this code fragment, the intuition is that
\lstinline{call_once} will be called
only once: either by \lstinline{f}, or by \lstinline{g}, but not by
both.
But if re-entrant locks are allowed, and \lstinline{h} happens to call
\lstinline{g}, \lstinline{call_once}
will be called \emph{twice}.
If re-entrant locks aren't allowed, then \lstinline{h} calling
\lstinline{g} results in a deadlock, which is not great either. But,
assuming it would be a serious error to call
\lstinline{call_once}, a deadlock is
preferable. The kernel developer will observe the deadlock (the
kernel panics) and can fix the code to avoid it, while calling
\lstinline{call_once} twice may
silently result in an error that is difficult to track down.
For this reason, xv6 uses the simpler to understand non-re-entrant
locks. As long as programmers keep the locking rules in mind,
however, either approach can be made to work. If xv6 were to use
re-entrant locks, one would have to modify \lstinline{acquire} to
notice that the lock is currently held by the calling thread. One
would also have to add a count of nested acquires to struct spinlock,
in similar style to \lstinline{push_off}, which is discussed next.
\section{Locks and interrupt handlers}
\label{s:lockinter}
%%
Some xv6 spinlocks protect data that is used by
both threads and interrupt handlers.
For example, the
\lstinline{clockintr}
timer interrupt handler might increment
\indexcode{ticks}
\lineref{kernel/trap.c:/^clockintr/}
at about the same time that a kernel
thread reads
\lstinline{ticks}
in
\indexcode{sys_sleep}
\lineref{kernel/sysproc.c:/ticks0.=.ticks/}.
The lock
\indexcode{tickslock}
serializes the two accesses.
The interaction of spinlocks and interrupts raises a potential danger.
Suppose
\indexcode{sys_sleep}
holds
\indexcode{tickslock},
and its CPU is interrupted by a timer interrupt.
\lstinline{clockintr}
would try to acquire
\lstinline{tickslock},
see it was held, and wait for it to be released.
In this situation,
\lstinline{tickslock}
will never be released: only
\lstinline{sys_sleep}
can release it, but
\lstinline{sys_sleep}
will not continue running until
\lstinline{clockintr}
returns.
So the CPU will deadlock, and any code
that needs either lock will also freeze.
To avoid this situation, if a spinlock is used by an interrupt handler,
a CPU must never hold that lock with interrupts enabled.
Xv6 is more conservative: when a CPU acquires any
lock, xv6 always disables interrupts on that CPU.
Interrupts may still occur on other CPUs, so
an interrupt's
\lstinline{acquire}
can wait for a thread to release a spinlock; just not on the same CPU.
Xv6 re-enables interrupts when a CPU holds no spinlocks; it must
do a little book-keeping to cope with nested critical sections.
\lstinline{acquire}
calls
\indexcode{push_off}
\lineref{kernel/spinlock.c:/^push_off/}
and
\lstinline{release}
calls
\indexcode{pop_off}
\lineref{kernel/spinlock.c:/^pop_off/}
to track the nesting level of locks on the current CPU.
When that count reaches zero,
\lstinline{pop_off}
restores the interrupt enable state that existed
at the start of the outermost critical section.
The
\lstinline{intr_off}
and
\lstinline{intr_on}
functions execute RISC-V instructions to disable and enable
interrupts, respectively.
It is important that
\indexcode{acquire}
call
\lstinline{push_off}
strictly before setting
\lstinline{lk->locked}
\lineref{kernel/spinlock.c:/sync_lock_test_and_set/}.
If the two were reversed, there would be
a brief window when the lock
was held with interrupts enabled, and
an unfortunately timed interrupt would deadlock the system.
Similarly, it is important that
\indexcode{release}
call
\indexcode{pop_off}
only after
releasing the lock
\lineref{kernel/spinlock.c:/sync_lock_release/}.
%%
\section{Instruction and memory ordering}
%%
It is natural to think of programs executing in the order
in which source code statements appear.
That's a reasonable mental model for single-threaded
code, but is incorrect when
multiple threads interact through shared memory~\cite{riscv:user,boehm04}.
One reason is that compilers emit load and store
instructions in orders different from those implied
by the source code, and may entirely omit them
(for example by caching data in registers).
Another reason is that the CPU may execute instructions
out of order to increase performance.
For
example, a CPU may notice that in a serial sequence of
instructions A and B are not dependent on each other.
The CPU may start instruction B first, either because its
inputs are ready before A's inputs, or in order to overlap
execution of A and B.
As an example of what could go wrong,
in this code for
\lstinline{push},
it would be a disaster if the compiler or CPU moved the
store corresponding to
line~\ref{line:next2} to a point after the
\lstinline{release}
on line~\ref{line:release}:
\begin{lstlisting}[numbers=left,firstnumber=1]
l = malloc(sizeof *l);
l->data = data;
acquire(&listlock);
l->next = list; (*@\label{line:next2}@*)
list = l;
release(&listlock); (*@\label{line:release}@*)
\end{lstlisting}
If such a re-ordering occurred, there would be a window during
which another CPU could acquire the lock and
observe the updated
\lstinline{list},
but see an uninitialized
\lstinline{list->next}.
The good news is that compilers and CPUs help concurrent programmers
by following a set of rules called the \indextext{memory model}, and
by providing some primitives to help programmers control re-ordering.
To tell the hardware and compiler not to re-order,
xv6 uses
\lstinline{__sync_synchronize()}
in both
\lstinline{acquire} \lineref{kernel/spinlock.c:/^acquire/}
and
\lstinline{release} \lineref{kernel/spinlock.c:/^release/}.
\lstinline{__sync_synchronize()}
is a \indextext{memory barrier}:
it tells the compiler and CPU to not reorder loads or stores across the
barrier.
The barriers in xv6's
\lstinline{acquire}
and
\lstinline{release}
force order in almost all cases where it matters,
since xv6 uses locks around accesses to shared data.
Chapter~\ref{CH:LOCK2} discusses a few exceptions.
%%
\section{Sleep locks}
%%
% this material refers a lot to sleep, yield, etc, so maybe it should
% be later, after sleep/wakeup in sched.tex.
% maybe in lock2.tex
Sometimes xv6 needs to hold a lock for a long time. For example, the
file system (Chapter~\ref{CH:FS}) keeps a file locked while reading
and writing its content on the disk, and these disk operations can
take tens of milliseconds. Holding a spinlock that long would lead to
waste if another process wanted to acquire it, since the acquiring
process would waste CPU for a long time while spinning. Another
drawback of spinlocks is that a process cannot yield the CPU while
retaining a spinlock; we'd like to do this so that other processes can
use the CPU while the process with the lock waits for the disk.
Yielding while holding a spinlock is illegal because it might
lead to deadlock if a second thread then tried to acquire the spinlock;
since {\tt acquire} doesn't yield the CPU, the second thread's
spinning might
prevent the first thread from running and releasing the lock.
% it's true that timer interrupts might force a switch from the
% second back to the first thread -- but not if the second
% already holds one or more locks.
% it's also true that the danger is most obvious on a uniprocessor,
% but it can also arise with a longer chain of threads on
% a multiprocessor.
Yielding while holding a lock would also violate
the requirement that interrupts must be off while a spinlock is held.
Thus we'd like a type of lock that yields the CPU while waiting to
acquire, and allows yields (and interrupts) while the lock is held.
Xv6 provides such locks in the form of
\indextext{sleep-locks}.
\lstinline{acquiresleep}
\lineref{kernel/sleeplock.c:/^acquiresleep/}
yields the CPU while waiting,
using techniques that will be explained in
Chapter~\ref{CH:SCHED}.
At a high level, a sleep-lock has a
\lstinline{locked}
field that is protected by a spinlock, and
\lstinline{acquiresleep} 's
call to
\lstinline{sleep}
atomically yields the CPU and releases the spinlock.
The result is that other threads can execute while
\lstinline{acquiresleep}
waits.
Because sleep-locks leave interrupts enabled, they cannot be
used in interrupt handlers.
Because
\lstinline{acquiresleep}
may yield the CPU,
sleep-locks cannot be used inside spinlock critical
sections (though spinlocks can be used inside sleep-lock
critical sections).
Spin-locks are best suited to short critical sections,
since waiting for them wastes CPU time;
sleep-locks work well for lengthy operations.
%%
\section{Real world}
%%
Programming with locks remains challenging despite years of research
into concurrency primitives and parallelism.
It is often best to conceal locks within
higher-level constructs like synchronized queues, although xv6 does not
do this. If you program with locks, it is wise to use a tool that attempts to
identify races, because it is easy to miss an invariant that requires
a lock.
Most operating systems support POSIX threads (Pthreads), which allow a
user process to have several threads running concurrently on different
CPUs. Pthreads has support for user-level locks, barriers, etc.
Pthreads also allows a programmer to optionally specify that a lock
should be re-entrant.
Supporting Pthreads at user level requires support from the operating
system. For example, it should be the case that if one pthread blocks
in a system call, another pthread of the same process should be able
to run on that CPU. As another example, if a pthread changes its
process's address space (e.g., maps or unmaps memory), the kernel must
arrange that other CPUs that run threads of the same process update
their hardware page tables to reflect the change in the address space.
It is possible to implement locks without atomic
instructions~\cite{lamport:bakery}, but it is expensive, and most
operating systems use atomic instructions.
Locks can be expensive if many CPUs try to acquire the same lock
at the same time. If one CPU has a lock
cached in its local cache, and another CPU must acquire the lock, then the
atomic instruction to update the cache line that holds the lock must move the line
from the one CPU's cache to the other CPU's cache, and perhaps
invalidate any other copies of the cache line. Fetching a cache line from
another CPU's cache can be orders of magnitude more expensive than
fetching a line from a local cache.
To avoid the expenses associated with locks, many operating systems
use lock-free data structures and algorithms~\cite{herlihy:art,mckenney:rcuusage}.
For example, it is possible to implement a linked list like the one in
the beginning of the chapter that requires no locks during list
searches, and one atomic instruction to insert an item in a list.
Lock-free programming is more complicated, however, than programming
locks; for example, one must worry about instruction and memory
reordering. Programming with locks is already hard, so xv6 avoids the
additional complexity of lock-free programming.
%%
\section{Exercises}
%%
\begin{enumerate}
\item Comment out the calls to
\lstinline{acquire}
and
\lstinline{release}
in
\lstinline{kalloc}
\lineref{kernel/kalloc.c:/^kalloc/}.
This seems like it should cause problems for
kernel code that calls
\lstinline{kalloc};
what symptoms do you expect to see?
When you run xv6, do you see these symptoms?
How about when running
\lstinline{usertests}?
If you don't see a problem, why not?
See if you can provoke a problem by inserting
dummy loops into the critical section of
\lstinline{kalloc}.
\item Suppose that you instead commented out the
locking in
\lstinline{kfree}
(after restoring locking in
\lstinline{kalloc}).
What might now go wrong? Is lack of locks in
\lstinline{kfree}
less harmful than in
\lstinline{kalloc}?
\item If two CPUs call
\lstinline{kalloc}
at the same time, one will have to wait for the other,
which is bad for performance.
Modify
\lstinline{kalloc.c}
to have more parallelism, so that simultaneous
calls to
\lstinline{kalloc}
from different CPUs can proceed without waiting for each other.
\item Write a parallel program using POSIX threads, which is supported on most
operating systems. For example, implement a parallel hash table and measure if
the number of puts/gets scales with increasing number of cores.
\item Implement a subset of Pthreads in xv6. That is, implement a user-level
thread library so that a user process can have more than 1 thread and arrange
that these threads can run in parallel on different CPUs. Come up with a
design that correctly handles a thread making a blocking system call and
changing its shared address space.
\end{enumerate}
% LocalWords: CPUs uniprocessor interruptible allocator kalloc kfree
% LocalWords: struct malloc sizeof listlock invariants spinlock lk
% LocalWords: amoswap cpu bcache ftable itable inode vdisk DMA kmem
% LocalWords: pid proc's tickslock inode's ip buf's proc SCHED sys
% LocalWords: consoleintr wakeup clockintr intr acquiresleep POSIX
% LocalWords: Pthreads pthread unmaps TLB IPIs