-
Notifications
You must be signed in to change notification settings - Fork 15
/
Copy pathfsid-change.txt
173 lines (148 loc) · 9.59 KB
/
fsid-change.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
BTRFS Metadata UUID (FSID change) feature
SCOPE
==================================
The most important requirement for this feature is to ensure that while FSID
change is in progress a sudden power loss won’t lead to unrecoverable
filesystem upon next boot. This document presents analysis of the failure modes
and their automatic handling inside the kernel.
This analysis looks at only the use case when a FSID is initiated on a regular
btrfs filesystem (ie. one that is not a clone of an existing btrfs filesystem).
Implementation
===============================================
Since btrfs cannot guarantee that superblock writes can occur atomically across
multiple devices it’s important to work around this deficiency. The proposal
set forth in this document is to introduce 2 flags in the btrfs super:
IP (IN PROGRESS) - it signals that FSID is in progress it’s purposes will be
explained later
C (Complete) - this flag will signal that filesystem change has already been
completed.
All things considered I envision the following sequence of operations:
1. The IP flag is set in transaction X and applied to all disks.
2. Then in transaction X+1 the C flag (this will be an incompat bit) will be set,
current FSID will be copied to the metadata_uuid field and a new FSID is going
to replace the current fsid, at this point the operation is considered complete.
Power loss can happen anytime before, during or after either of trans X or
trans X+1. Let’s look the failures in each of those cases:
1. Failure before transaction X is committed - not a problem since nothing is
committed on disk, user just has to restart operations via btrfstune
2. Failure during transaction X - this means that transaction X will be
persisted on number of N-out-of-M (N < M) disks for a filesystem consisting
of M disks. This will in essence create 2 partitions P and Q, where disks
in P will have the IP flag and disks in Q won’t and the super block
generation of disks in P will be greater than the one in disks in Q. Since
disks in btrfs are scanned arbitrarily either disks from P or Q can be
scanned first hence create the initial fs_devices structure to represent
the filesystem. This needs to be handled.
2.1 Let’s consider the first case where a disk from partition P is
scanned first. In this case the scan code will detect that IP is set
but since existing FSID is not changed so the fs_devices struct will be
created with the old FSID identifying the filesystem. Then all other
disks in P that are scanned will follow the same logic and if they
already detect an fs_devices struct will just be added to it. Also
their IP flags will be cleared upon next transaction commit. Following
this, if a disk from partition Q are scanned they will just find the
existing fs_devices and will be added to it and the IP flag will be
cleared on mount. Additionally fs_devices created will be marked as
having been created from disks with the IP flag set. This will be
required in order to handle case 4.2 (see below).
2.2 The second case to consider is a disk from partition Q being
scanned first. In this case fs_devices will be created as usual (using
the unchanged FSID) then all other Q disks will be added to this
fs_devices and their IP flag is going to be cleared. On the other hand
all disks from partition P will detect their IP flag and since FSID is
not changed they can just scan existing fs_devices for one which matches
their FSID. One important thing to heed is that during mount btrfs’
mount code always users the superblock from the device with the highest
generation. Disk in P will have a higher generation number as such when
a filesystem is mounted a disk in set P will be used. In order to
preserve consistency the IP flag must be cleared.
3. Failure after transaction X/before transaction X+1 - this will be handled
the same way as in case 2, i.e IP flag will be cleared on mount and the
filesystem will be created with the unchanged FSID.
4. Failure during transaction x + 1. When such a failure happens the
filesystem in question will be partitioned in two sets P and Q.
Q would have just the old fsid and the IP flag, also the superblock’s generation
number of disks in P will be higher than those in Q.
There are two cases of P here accoring to the new FSID value:
4.1 The new fsid value written to P differs from the old metadata_uuid value.
Where P would have the C flag and a new value for fsid written to it as
well as the old FSID value written in the ‘metadata_uuid’ field.
In contrast Again two cases needs to be handled:
4.1.1 The first one is a disk from partition P being scanned initially.
This means that fs_devices will be created with differing
fsid/metadata_uuid values.
Subsequently, when other disks from P are scanned they will just
be added as normally. The interesting case is when a disk from partition
Q are scanned. In this case the code should see that the superblock in
question has the IP flag set and at the same time will see that there
is already a fs_devices struct which has differing metadata/fsid members
AND the metadata_uuid matches that of FSID in disk in Q. In this case
the disk will be added to the filesystem, same logic applies to rest
of disks in partition Q upon next transaction commit their SB will be
overwritten with the correct information.
4.1.2 Disks in Q are handled the same way as in scenario 2.1. The main
difference is how disks in P are handled. When such disks are scanned
they will have differing metadata/fsid uuid, however they will detect
that an fs_devices exist such that the metadata of the disk matches the
metadata/fsid of the fs_devices AND the IP flag is set in fs_device
indicating that it was created from a superblock that had the IP flag
set. This will be an indication to the scanning code that it needs to
add the disk, additionally since disks in P will have a higher
superblock generation number than disk in Q upon adding such a disk to
a fs_devices it needs to overwrite the fsid/metadata details, since
upon mount a disk in P will be used.
4.2 The new fsid value written to P is same with the old metadata_uuid
value. Now P's 'metadata_uuid' field is cleared and the field to
indentify by is only 'fsid'. P does not have the C flag or the IP flag
either. Two cases needs to be handled:
4.2.1 The first one is a disk from partition P being scanned initially.
This means that fs_devices will be created with same
fsid/metadata_uuid values. Subsequently, when other disks from P
are scanned they will just be added as normally.
The interesting case is when a disk from partition Q are scanned.
In this case the code should see that the superblock in
question has the IP flag set and at the same time will see that there
is already a fs_devices struct which has same metadata/fsid members
AND the metadata_uuid matches that of FSID in disk in Q. In this case
the disk will be added to the filesystem, same logic applies to rest
of disks in partition Q upon next transaction commit their SB will be
overwritten with the correct information.
4.2.2 Disks in Q are handled the same way as in scenario 2.1. The main
difference is how disks in P are handled. When such disks are
scanned first, fs_devices will have differing metadata/fsid
uuid. However they will detect that an fs_devices exist such that
the fsid of the disk matches the metadata of the fs_devices
AND the IP flag is set in fs_device indicating that it was created from
a superblock that had the IP flag set.
This will be an indication to the scanning code that it needs to
add the disk, additionally since disks in P will have a higher
superblock generation number than disk in Q upon adding such a disk to
a fs_devices it needs to overwrite the fsid/metadata details, since
upon mount a disk in P will be used.
5. Failure after transaction X + 1 - the filesystem is already changed so
no need to do special handling
6. It is possible that there exists such a super block that has both IP and
C flags set. This can occur in situation where a disk has undergone a
successful metadata uuid change so it has the correct C flag set. When a
subsequent FSID is initiated on such a fs the IP flag will be set as part
of the new operation. During a failure of transaction X then we can end up
with disks comprising this filesystem partitioned in two sets P and Q.
Disks in P will have both C and IP flags set and disks in Q will have only
C flag and their fsid will be different than that of disks in Q. Disks in
both partition will have identical metadata_uuid. The 2 cases shall be
handled as follows:
6.1 When a disk in P comes up it should scan the existing registered
fs_devices for one which has differing fsid/metadata_uuid AND whose
metadata uuid is equal to that of the disk and should join them. If none
such is found then a new fs_devices should be created with the IP flag
set. Subsequently as disks in Q come in they should scan for such an
fs_device that has the IP flag set, whose metadata_uuid equal that of
the scanned disk and whose fsid/metadat_uuid members differ. This will
be a sign that this disk must be joined in the respective fs_devices
and must also replace the fsid/metadata_uuid in struct btrfs_fs_devices
to its own values.
6.2 When a disk in Q is scanned first, it will create fs_devices as
usual. Then as disks from P appear they will scan the existing
fs_devices for one with a matching metadata_uuid and a differing
metdata_uuid/fsid in the fs_devices and should join that fs_devices
collections.