-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathDESIGN_v2
184 lines (170 loc) · 9.05 KB
/
DESIGN_v2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
v2 of SEALFS
Tested for linux kernel 5.4.0-66
- Features:
Ratchet:
There is an ratchetoffset, from 0 to NRATCHET-1 (nratchet is
a mount option and autodetection to verify). Each log entry has an
associated ratchet offset, when it is zero (rekeying), there
is a new key read from the file. When it is different than
zero the key is KeyN=HMAC(KeyN-1, [ratchetoffset|NRATCHET]).
The idea is that resynchronization can happen for each block
(i.e. with the key from the file one can regenerate the whole
block of keys). Note that NRATCHET is also hashed so that adding
an entry later is forbidden.
The tools try to keep the latest key0 or keyN and recalculate lazily,
(KeyCache) in entries.c.
It is interesting that rekeying is slightly cheaper than
ratcheting (when NRATCHET gets bigger, the window for the lock is
bigger and hence it is slower, can be seen in the measurements). I
think this is due to readahead being done aggresively and how
fast the whole i/o stack is in linux. Because of readahead and
all the cache mechanisms, the cost of reads (which need to be done synchronously
because we have to wait for the keys) is amortized, and because
of that, rekeying cost is almost a memmove. Ratcheting is more expensive
(malloc of hash, hashing, freeing and zeroing...). The state
of the last block of NRATCHET keys is the last key ratcheted.
We could keep it around between remounts. Instead of that, we burn
entries with count=0 for the rest of the block when we unmount.
If the number of burnt entries is not an integer multiple of
NRATCHET, mounting gives back an error (a stat done for the log file when
mounting). The inode for these entries is FAKEINODE.
Note NRATCHET is not written anywhere (it is obtained automatically
with the first log (ratchetoffset = 1) by trying a key with different
NRATCHET values, essentially brute forcing it.
Tooling:
Added binary heap to verify tool the entries in the logs are out of order
(logs earlier in the file refer to offsets in the file out of order).
This means, for example, for inode 7, there can be an entry
for offset 32 and then one for offset 21. When we take out
the entries we reorder them. With one caveat, if the offset of
the file is the same, they can be dummy entries (for sealing the
log). To keep these in order, we combine the offset with the ratchet
offset when we put it in the heap.
The heap is a min heap which receives a pointer, implemented in
heap.c heap.h. See checkjqueues() in verify.c
Anything to do with entries should be in entry.c entry.h, including ratcheting of
keys.
Testing with uroot: https://github.com/u-root/u-root
- In Ubuntu (already have thing to compile a kernel and golang installed)
apt install qemu-system qemu-system-x86
apt install qemu-kvm libvirt-clients libvirt-daemon-system bridge-utils virt-manager
(cd $UROOT_PATH; git clone https://github.com/u-root/u-root;cd u-root; go mod vendor; go install)
- cd tools/uroot; ./runtestu.sh
Runs all tests implemented inside a qemu. Should say OK.
Verify remap of inodes. Important to copy between filesystems the files -32 43 remaps
old inode number 32 to 43. Remember that in sealfs files are addressed by inode number
so that logs can be rotated.
Concurrency:
Threads:
There are N burner kthreads that have a burn counter
for the keystream and try to follow the atomic integer
shared with clients to burn until that point. We have
a fast burner thread (zero). This is woken each second
and on each write. It does not go to sleep if there is
something left to burn.
The rest of theads wake every P seconds (each P a different
prime in a cicada inspired anti-synchronization strategy).
These threads try as hard as possible to sync to disk by marking
the region burnt as dirty (fsync_range) with a lock. It may
have to be revisited. In any case, we try not to share any locks
with the clients (by using atomic operations).
We still have the burner process in user space daemon if we want.
Locking:
Four possible models:
Old: take a lock, do everything synchronously in the client context.
First try (fast lock in git): Idea, take lock, pick an offset, release lock burn that entry and
key offset. Everything is controlled by the offset.
Can only be done in one thread (each thread gets an offset and a small island
in key and log file).
Second try (implemented now): This version has an atomic_t integer (burnt) to communicate
client processes and burn threads. We want
to use burnt to communicate (with atomic
operations). This integer needs to be advanced
uniformly without leaving holes (if we could we
would have jiggle space to release locks). This
approach means that a lock is taken, key is
read (or ratcheted), and burnt is advanced and
then lock is released. The critical section
is quite big, but it is important to have a
sequential order for clients or burnt does not
advance uniformly and the threads step on one
another. Having the burners asynchronous from
the clients is important as we have to wait and
make sure the writes for burning go to disk,
which is potentially slow.
Third try (not implemented): This version, not implemented and probably never (is current version
is fast enough already?: As of now 10x 15x
nosealfs) would make it blazingly fast but the
memory cost is proportional to the concurrent
writes and the complexity can be high. Each
read of key takes an offset. This generates an
operation which goes to a queue/data structure. The burners
unqueue and burn that region. Maybe in order
(heap) or randomly. I feel this is too
complex for kernel code, but we may want to
implement it. We could limit the queue (channels,
reader/writer ring...) but then the clients and
the burners would have to wait for each other.
No matter what, ratcheting has to be done with a lock key_n->key_n+1
so there is a causal dependency there that cannot be changed whatever locking is used.
Log files are open append only and because of that
offset is ignored. The offset has to be manually
updated in the inode. For that we take the lock of the:
down_write(&ino->i_rwsem); inode of the upper file to
arbitrer this. Should we use the lower_file inode to
make hard links work? Does it matter (are there any
other problems for hard links?).
New version:
We open the log files without append. We deal
with the offset, down_write(&ino->i_rwsem);
commit to an offset by setting inode size,
up_write(&ino->i_rwsem); then write to the
underlying offset. As the file was opened without
append, the offset is actually honored.
Memory:
After some measuring, it seems to be a bad idea to keep the hash
state around and reuse it. I think it is pingponging data in L1/L2
caches and so on. Sligthly worse, and more complex. The only
problem may be fragmentation. Hopefully taken care of. So it seems it is
faster to malloc and performance does not seem to degrade (maybe
revisit later, slab?).
When hashing, the memory for the write buffer needs to be copied from user, only
to be hashed. For that we need a fixed memory space which we
can reuse but big enough so that there are not many calls to
copy_from_user which may sleep and so on. We get a page, map
it and free it (fast, no fragmentation and 4K or so of memory
should be enough for most writes). Should we add it to the hot page cache?
Optimizations:
- Reuse hmacstate for ratchet and for hmac.
- Flip pages and hash directly from user memory instead of
copy_from_user. One alloc of a page less. No copying.
May kmap more adresses into kernel space (TLB misses?).
Mitigation
When rekeying we used not to ratchet the key with previous
state (only take it from the key file). This is dangerous (and fast :-)).
We cannot do this because:
Attack?: deleting all nratchet != entries. This makes holes in files
unless there are no entries for the holes outside. We need to
hash nratchet in nroff=0 entries somehow (at some cost it is
just ratchet_key once with offset 0, do the same when rekeying
in verify and nratchet_detect when offset is 0 also).
One way, ratchet the key once with roff = 0:
/home/paurea/gits/sealfs/file.c:467
if(nratchet != 1)
ratchet_key(sb->key, 0, sb->nratchet, hmacstate);
memmove(key, sb->key, FPR_SIZE);
/home/paurea/gits/sealfs/tools/entries.c:285
if(nratchet != 1)
ratchet_key(kc->key, 0, nratchet);
/home/paurea/gits/sealfs/tools/verify.c:323
take out the condition e.ratchetoffset >= 1
remeasure everything :-)
Future work:
Try to detect nratchet once when an entry fails, when can grow nratchet and degrade gracefully
This is only two small changes: 1) detect nratchet on fail. 2) Make the kernel switch
nratchet with some policy (double nratchet when keystream file gets small enough, for example).
3) Be careful not to surpass NBitsRatchet (in verify.c) for the ratchet size. This can
be changed to be made "very big" by changing the heap implementation.
(using 128 bits type size or comparing carefully two 64 bits integers).
Another way would be by making the heap call a pointer to a function to compare a type
but it is probably too much.