Different accumulator design #249

adiabat · 2021-02-13T20:53:01Z

adiabat
Feb 13, 2021

I've mentioned this a couple times, related to proof size reduction, using different tree types, etc. Figured it'd be a good place to write it out.

History

Long ago, before the COVID19, utreexo began with a very different design. There were no swaps, no collapses, no remove transforms. This is the story of...

Dead-leaf forest

The earlier design didn't have removes the way utreexo does now, and it was more or less the same as the construction in [8] cited in the paper. In that paper's appendix A - Limited Dynamism, the authors say that you can delete stuff by replacing a leaf with ⊥, which we'll call a dead leaf rather than upside-down-T. (called "tombstones" in other papers I think)

If you just replace all the deletions with tombstones, you pretty quickly get 95% dead leaves. The forest gets up to 30 high, as there's about a billion total txos ever, compared to 70M utxos. First thing to notice: Huh, this doesn't sound that bad, right? Max proof length of 30 instead of 26? No big deal right? It's actually a little bit worse since things get so sparse that proofs don't overlap as well.

Which leads to the observation: You can do better than just replacing things with dead leaves. There are whole huge chunks of dead leaves. Say we define our dead leaf to be ffffffff... 256 1-bits in a row. If you do that, you'll see that you end up with a lot of proofs that have a first element of ffffffff... as it's the sibling of the thing you're proving. You'll also see a bunch of proofs that have 7f0c0e0f... as the 2nd element of the proof. Well that makes sense, 7f0c0e0f... is the parent of two dead leaves, or hash(ffff..., ffff....)

Similarly with the parent of two 7f0c0e0f...s and so on, you can precompute the whole partents of dead leaves up to 64 high or whatever, and recognize them. To make it even simpler, you can redefine your Parent() function to say that the parent of 2 dead leaves returns the same dead leaf.

What happens then is that you "unearth" these dead leaves all the time when you check proofs. And what you can do with them is repopulate them. Just as in utreexo today, you perform deletion first, then addition. But unlike utreexo today, you don't start adding at numLeaves, you start adding at the left-most unearthed dead leaves. Since the more dead leaves there are, the more likely you'll find one in a proof, and the more clustered they are the more you'll large runs of them, you end up with an equilibrium of dead and live leaves. In the mainnet runs I did, it was a bit above 50% dead leaves, so in theory you're adding just 1 extra hash to each branch.

Well this seems OK, and could probably work. One of the disadvantages though is the forest gets messy, as new leaves get inserted all over the place, wherever you happen to unearth old dead leaves. So you don't seem to get as much proof overlap as we have in the swap-based utreexo. The swap based also seems "cleaner" as there's no weird dead stuff in the forest; every leaf is a UTXO and when it's gone it's gone. Old leaves tend to 'bubble' to the left where the proofs are longer. Overall, the current swap-based accumulator design seems better, right?

Well, maybe. There's another property the dead leaf forest has that could tip the scales and make it better than swap based. With the dead leaf forest, leaves never move. That's nice in that you don't have to swap them, so deletion is simpler (no remTrans2()), but also a leaf can commit to it's own position!

Why would that matter? In the current leaf, we have this:

type LeafData struct {
	BlockHash [32]byte
	Outpoint  wire.OutPoint
	Height    int32
	Coinbase  bool
	Amt       int64
	PkScript  []byte
}

That all gets hashed and becomes the leaf hash, which goes into the forest. In the dead leaves forest, we could have the serialization include the leaf's own position:

type LeafData struct {
	BlockHash [32]byte
	Position  uint64
	Outpoint  wire.OutPoint
	Height    int32
	Coinbase  bool
	Amt       int64
	PkScript  []byte
}

I think this extra few bytes of data going into the leaf hash lets us cut proof sizes in half.

Collisions and Pre-Images

Collision attacks are very relevant to utreexo security. If the attacker can collide a real leaf hash with a fake one, they can spend the fake one and fool everyone. They could make a "real" utxo which has some reasonable amount of fund, like 0.00001 BTC, and make a "fake" utxo, with an address they contol and a much larger amount of BTC, like 100,000. They then vary both the real and fake ones and try to find a case where they both hash to the same thing. They broadcast a transaction creating the "real" utxo, but then spend the "fake" one. Since both UTXOs have the same leaf hash, the proofs are the same, so utreexo nodes will see a valid proof for the "real" utxo as a valid proof for the "fake" one, and the attacker can get a bunch of money.

Collision attacks take around 2^(n/2) work for n-bit hashes. So colliding sha256 should take around 2^128 operations. 128 bits is the (unofficial?) security parameter for bitcoin. So we're good here.

Collision attacks are much easier than pre-image, or 2nd preimage attacks. pre-image attacks should take 2^n work, so a pre-image attack for sha256 should take 2^256 work, which is way overkill since there are a bunch of things in bitcoin (like the signatures) that only resist 2^128. Also in general, when hash functions break, the collision resistance tends to break first, so if you have a system or algorithm that only relies on the pre-image resistance of a hash function, that's seen as a stronger system than one which relies on the collision resistance of the hash function.

While utreexo and merkle tress in general use collision resistance, there are some ways around this with utreexo. Even in the current LeafData, we have that BlockHash field. That makes collision attacks much more difficult; the attacker can use an old, known BlockHash for the "fake" utxo, but for the new one they're stuck as they don't know what the next BlockHash will be. So they have some options:

Make lots of "fake" utxos and use only those to collide. This is no longer a collision attack, and is 2nd preimage, which takes 2^n, so that's no good.
Create both "fake" and "real" utxos to collide in the traditional way (cycle finding, etc). This is extremely expensive, as the collision candidates on the "real" side cost 2^74 hash operations each. Instead of 1. Also each successful "real" candidate that you don't broadcast is a whole block reward the attacker left on the table. So they should broadcast them!
Seed a bunch of "real" utxos, wait for them to confirm, then try to collide with them. Each "real" utxo created helps! 256 of them shaves 8-bits of work off your collision effort.
Really you can just collide with any leaf in the forest. So maybe spam the utxo set a bit, but you already have 80M candidates, which helps your collision a bunch.

We can maybe reduce hash outputs down from 256 bits since normal collision attacks are no longer possible. However we can't go down to 16 bytes; there are still some attacks possible that are in between the standard 2nd pre-image and collision attacks in difficulty. Worse, this collision attack gets easier the more utxos there are.

Dead-leaf forests and collision attacks

This is why having a leaf commit to it's own position is great. If leaves don't move, and every leaf has it's own position inside it, then the attacker can't collide against any leaf in the forest. They need to pick some number to put into their "fake" utxo, and once they do, they can only collide with the leaf hash at that position if they want their "fake" utxo proof to be accepted. If they put in position = 5 for their "fake" utxo, and do manage to find a hash collision, but they collide with the leaf at position 23, node will not accept the proof, as the position in the leaf data and the position in the proof don't match up.

I guess you can't quite make the argument that you're right at 2^n work for a fake proof, as the attacker still does have the option to mine valid headers and drop them. I should figure out how much that really helps; it doesn't seem like it helps all that much. The attacker could create 2 blocks at the same height with any difference in the resulting leaf set. Then when they grind though "fake" utxos they have 2 choices, and they broadcast the one that they collide with, dropping the effort by 1 bit. But it was 2^75 work to go from 2^128 to 2^127. Also it's not cumulative; the attack has to succeed in under 10 minutes. TODO: figure out actual equation for this.

That's the other argument though: that an attacker that can pull off this attack is already so powerful that they can just destroy bitcoin anyway, so who cares about faking a proof. If an attacker has a million times more sha256 power than the bitcoin network, is that an attacker worth worrying about? They can already re-org to genesis and rewrite the entire chain faster than they can make a fake utreexo proof. So maybe that's a better metric than an arbitrary 2^128; we say the proofs can defend against a 99.9999% attack. (much more powerful than a 51% attack).

Anyway. Stuff to try out! I still have some code for the old dead-leaf forest we could try out at some point.

kcalvinalvin · 2021-02-17T17:23:01Z

kcalvinalvin
Feb 17, 2021

Just a random thought with swap based trees. Do you think it's maybe worth it to move subtrees around (like how cowforest is addressing them now) vs moving individual nodes around.

1 reply

adiabat Feb 18, 2021
Author

Subtrees do move around. Pollard swaps whole subtrees by switching two pointers; cowforest uses subtrees, and even forest on disk does one run of nodes per row. So it already feels subtree-y, but maybe there are other on disk representations that lend themselves to less disk I/O.

The dead-leaves design here has a lot less disk writing, as there are no swaps. When dead-leaves are inserted, everything up from there needs to be re-hashed and re-written though. But you won't get very large subtree swaps as we sometimes see in the current swap based forest.

There are probably... (er, almost definitely!) ways to avoid large subtree swaps with the current design, as there is a lot of freedom in how we swap things. ExtractTwins, for example, is totally optional. Also right now all the deletions and swaps collapse the forest to have no gaps, and then the new leaves are added at the bottom right. It could just as easily put the new leaves in the gaps, and then collapse, or any mix of the two. So for example, if it detected that a large subtree swap would happen, it could instead throw in some leaves from the list of additions. Lots of variants possible.

adiabat · 2021-03-22T03:38:34Z

adiabat
Mar 22, 2021
Author

Swapless deletion

Anyone who's worked on this codebase knows that swaps are annoying. They're confusing, things move around in weird ways, sometimes they have to be stashed / collapsed, all sorts of weird things like that.

Recently on eprint by Bolton Bailey and Suryanarayana Sankagiri, https://eprint.iacr.org/2021/340 doesn't go into detail about how they do deletions, but it seems like they just trim the deleted leaves and change the "prefix" number associated with each node. I don't think you need the prefix numbering, but the idea of moving leaves up seems to work.

So swapless utreexo deletion is perhaps a hybrid of the current utreexo design and the Bailey / Sankagiri design. Additions work the same as in utreexo currently. Deletions, however, don't use swaps anymore.

To delete a leaf, promote its sibling to its parent. So
DeletedNode.Parent = DeletedNode.Sibling
Simple enough. This still works with csns as nodes which are proved will have parents available to be reassigned to the sibling's hash.

This would then create "hollowed out" trees on the left as leaves are deleted and work their way up to towards the root, and new leaves are added on the right. One question is what to do when leaves work their way up to the root, or if a root is deleted. I'm leaning towards don't do anything, and leave a dead root, but there's other ways to do it. To illustrate the issue (as well as the new simpler deletion mechanism):

12              
|-------\     
08      09      10      
|---\   |---\   |---\ 
00  01  02  03  04  05  06

Say we've got 7 leaves as above. Then we delete 01, and 00 gets promoted to where 08 was:

12              
|-------\     
00      09      10      
        |---\   |---\ 
        02  03  04  05  06

Next we delete 02, and 03 gets promoted:

12              
|-------\     
00      03      10      
                |---\ 
                04  05  06

Already we can see that we could "drop" the 12 root down one row, and make it sibling's with 10. This is possible without moving anything left or right, but we do need to know that 00 and 03 are leaves in order to be able to recognize that 12 can drop. If we just deleted 02, we know 03 is a leaf since we just promoted it, and we know what 00 is, but we don't know that 00 is a leaf unless we keep additional data beyond the hash. Either a leaf bit (1 bit) or a max depth beneath node, which seems hard to keep track of.

Lets go another step and delete 00:

From deleting 00, we know that 03 has no sibling as its been promoted to root position. We know the value of 03, but without a "leaf bit" we don't know if it's a leaf or not. If its not a leaf, though, we know it can't have more than 2 leaves under it, since it just moved up. So it could be paired with 10 instead of leaving it floating here.

Then we could even delete 03, and have the whole tree gone, and could move the 10 tree to the left or something. I'm leaning towards not doing anything and letting it stay empty, as then when leaves are inserted they can only move up. Also the insertion position of a txo would be the order it shows up in; numleaves never goes down but just becomes txosEver. In one way that's bad; there will be more roots as the number of roots will be log(txos) not log(utxos). But maybe that's not a big deal, storing a few more roots isn't hard. But it allows for leaves to commit to their own position, or really position prefix, which never changes. If 03 is in the hash, if it's at the bottom we pick root 0, and go right, right. If we get to the end before going through all the bits of the leaf position, that's OK, it just went up. This looks like it provides defense against collision attacks mentioned here #249.

The biggest problem I see is that the way we store forest on disk or in ram doesn't fit well with this swapless tree. There will be large hollowed out areas that are fine for proofs and pollards, but would be lots of 0s in the forest on disk file. Maybe cowforest can be better adapted to deal with this. Or maybe we have to do a pollard-on-disk, which really we should have anyway. It might not be a big deal to let the forest stay on disk and get bigger / empty for a bridge node. The forest would get to like 60GB or so (1 billion txos, 32 bytes each, *2 for the tree) but probably doesn't have much i/o. There are no swaps moving things around, so each deletion only changes ~96 bytes (or only changes 32 bytes if you leave the dead hashes below instead of zeroing them out)

---

More thoughts:

No, there's still a lot of disk i/o. Say we're here:

12              
|-------\     
00      09      10      
        |---\   |---\ 
        02  03  04  05  06

And then 00 gets deleted. 09 moves to 12, and then 02, 03 need to move to 00, 09. Higher up deletions can cause large movements upwards as every node in a subtree needs to move up. It's not as bad as high swaps in that it's only half as large, but still could cause a good deal of read/write access to the forest.

Another issue - while putting the leaf position into the leaf preimage helps against collision attacks, it's not perfect. Ideally the attacker would have to pick a specific leaf to attack, and collide with that, restricting them to a single hash to collide with, which is no longer a collision attack at all, but a 2nd preimage attack. This swapless transform doesn't quite get there, as while the attacker does need to pick a leaf, they can collide with any hash above that leaf to successfully forge a proof. Eg:

12              
|-------\     
08      09        
|---\   |---\   
00  01  02  03

The attacker chooses position 02. The start hashing with their evil utxo, and they win if they get a matching hash to 02, 09, or 12. So their attack is 3X more effective. The attacker will target leafs in the highest tree, and in bitcoin's case where you can get a few billion total txos, you have a height of 32 or so. So it's not quite 2nd preimage resistance, but 5 bits worse than that. (Maybe -- there could be more math involved.)

There are ways to stop this branch collision attack -- for example, keep another bit along with the hash indicating whether the node is a leaf or not. Then the attacker can't use a collision with a non-leaf node. They can still collide upper hashes with each other, though, and you're using a bit; if you just made the hash 1 bit longer that seems just as effective (more I think) So it doesn't seem that defending against targets in the branch are worth it. Another question is will there actually be any full depth leaves in the big tree on the left? In practice there might not be, but we can't rely on that for security.

20 replies

Shymaa-Arafat May 24, 2021

UTXOs that never get spent never need to be proven, so those don't matter for proofs.

But surely they affect the height of the tree, and hence other proofs.

Testnet spending data patters are very different than mainnet

So did u measure it on the real net?
Pls give a resource, link where I can measure it in real net, especially if u haven't calculate it for each kind of TXs separately.

We could make some guesses on how long a UTXO will live when it is first created, but for the IBD process everything is already known, so we don't have to guess.

The whole point of what I'm saying is to use old data statistics to predict for future data, never said u have to guess in IBD values?!
Maybe u can use IBD as a quicker test of the testnet UTXOs lifetimes; ie for example blocks 300,380 coinbase TXOs when did they got spent or r they still Unspent like those in the testnet?

kcalvinalvin May 25, 2021

UTXOs that never get spent never need to be proven, so those don't matter for proofs.

But surely they affect the height of the tree, and hence other proofs.

UTXOs that are provably unspendable are not added to the tree. This is the same behavior as Bitcoin Core as provably unspendable UTXOs are not added to the UTXO set. So they don't effect the tree in any shape or form.

Testnet spending data patters are very different than mainnet

So did u measure it on the real net?
Pls give a resource, link where I can measure it in real net, especially if u haven't calculate it for each kind of TXs separately.

You can a bitcoin node to get the data (https://bitcoincore.org/en/download/). The code that we used to parse the data is at https://github.com/mit-dci/utreexo/tree/master/cmd.

We could make some guesses on how long a UTXO will live when it is first created, but for the IBD process everything is already known, so we don't have to guess.

The whole point of what I'm saying is to use old data statistics to predict for future data, never said u have to guess in IBD values?!

It's widely known fact that newer UTXOs get spent faster, older ones stay unspent. Chart of this available section 5.3, figure 2 of the Utreexo paper (https://github.com/mit-dci/utreexo/blob/master/utreexo.pdf).

Shymaa-Arafat May 25, 2021

UTXOs that are provably unspendable are not added to the tree.

I do know about that and I did not include them(I mean even when I used the testnet data)

It's widely known fact that newer UTXOs get spent faster, older ones stay unspent.

I know about that too, and I clearly expressed that I'm talking UTXO kind (coinbase, merge-mine,....) not being old or new
-I just re-looked closely on Fig2, it tells order of 10 UTXO have life order of 10⁵, when I found about 20 in a scattered sample of 100.
-However it is from the testnet, I'll re-check the true one.
Thank u very much for the links🙏
.
-By the way, u may like to examine this true figure from
https://t.co/nWhWAMsWLs?amp=1
Where 2m TXOs got spent in 2weeks (1-15 May21), 1.4m of them in the 1st five days 1-5 May 2021

Shymaa-Arafat May 27, 2021

I think u should know that I tried the real data from
https://m.btc.com/
-Yes min(not max or av) lifetime of coinbase UTXOs is 101, but THERE ARE very large values
.
-The block 266668 merge-mine TX (the name from the Transaction Graph Analysis paper) is real, 20 coinbase UTXOs are input to one UTXO, they're from 100,000 to 108,000; meaning they all had a lifetime above 156,668
https://m.btc.com/00000000000000066283fa21e3d99992e9a3eef59aa5f423928d44084719166b
follow the 1000Btc Tx to check birth time of each input
https://m.btc.com/f3e6066078e815bb24db0dfbff814f738943bddaaa76f8beba360cfe2882480a

.
-Block 680,000 from last April 2021 coinbase UTXO got spent after 923 block, 681000&681500 Unspent yet till block 685000+
.
-However, I can't figure out yet the reason of the 2m drop in the no of UTXOs. Block 681500 is time-stamped 2May and most TXs l checked r regular 1:2 ( should increase the counter by 1)
https://m.btc.com/00000000000000000004b3f918b242c413ce24e2a73d012cc85f81538febc6e4

Shymaa-Arafat May 28, 2021

I found out there has been some research on predicting UTXOs by Stanford on 2015,
"Bitcoin UTXO Lifespan Prediction",
Robert Konrad& Stephen Pinto
, Dec11, 2015
(6 pages report) although no much add ups on it, results r somehow different

Ofcourse a lot happened since 2015 that changed a few things, but as I said I can still find UTXOs that live for more than 1000 blocks ....
-Check the first input at this TX (in block 681807
https://www.blockchain.com/btc/tx/2f9cdeb9f58c9ac2975a5a9e6e439dcabd9b7b67989e2a2dcf82998040278ae6
if click on the 1st UTXO, u'll find that it came from this block 669668
https://www.blockchain.com/btc/tx/3cb30629023073c61793c02add6d07e87841008523dd41c1c0d8a74d4be105f2
Lifetime=12,139
.
I reached for 681807 because I was trying to check the sudden drop in the number of UTXOs, so a one that was created in block 681807 got spent in block 682324 with shorter lifetime yes but still= 324+193=517
.
(I think the sudden drop in no was planned from Feb, someone was repeatedly doing what some paper once called merge-mine TX, ie adding up all his/her Btc in one UTXO
.
u should also view this as an insightful reason or justify for the observed long lifes

from a paper about POB
proof of Burn
https://youtu.be/Q5L8-GJVmZw

Shymaa-Arafat · 2021-04-28T19:15:48Z

Shymaa-Arafat
Apr 28, 2021

I'm kind of mixed up where to discuss this better here #249 or #257,
First is the "Hybrid design" just an idea or it's already approved????
I take it from here it is still in the brain storming phase, but from "Kcalvinalvin" words in #257

With a swapless design, the locality (newest leaves being on the rightmost side) will improve even more

Do u already have a swapless design rightnow?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different accumulator design #249

{{title}}

Replies: 3 comments 21 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Different accumulator design #249

History

Dead-leaf forest

Collisions and Pre-Images

Dead-leaf forests and collision attacks

Replies: 3 comments · 21 replies

adiabat Feb 18, 2021 Author

adiabat Mar 22, 2021 Author

Swapless deletion

---

Replies: 3 comments 21 replies

adiabat Feb 18, 2021
Author

adiabat
Mar 22, 2021
Author