Storage Refactor #13545

alexanderbez · 2022-10-13T18:53:38Z

alexanderbez
Oct 13, 2022

Context

My motivation for starting this discussion is to have a more open forum for discussion around the future of the store package in the SDK and the general storage architecture without polluting the existing Storage EPIC.

At the time of this writing, the Storage EPIC only contains high level ideas for what we currently envision needs to happen.

I believe it would be beneficial for the community as a whole if we discussed the major pain points with the current architectural modal and discuss approaches to tackle each and then come to consensus as a dev community for what we believe to be the best approach for each, striking a good balance between dev and client UX and performance.

NOTE: We will generally omit storev2 from these discussions as that implementation is being considered deprecated prior to any real production use due to a variety of reasons which will not be discussed here.

Problems

IAVL & Performance

Today the SDK uses a logical merkle-ized store per-module, that store being an IAVL store. The IAVL store, and thus per module, handles both State Commitment (SC) and State Storage (SS). There a few concerns with this approach:

Having each module reference its own logical store means having atomic commitment of state extremely difficult. See #13477.
IAVL is extremely slow, being the primary performance bottleneck in many aspects of the Cosmos stack, including fast and state sync, querying, state-machine performance, and application upgrades/migrations.
Having the IAVL act as both the SS and SC layer introduces an extremely tight yet unnecessary coupling that makes it difficult to experiment with more performant and sensible storage solutions (SS), while allowing the SC layer to be a much simpler implementation. See this summary snippet from @musalbas.

Historical Data & Archive Nodes

Archive nodes serve a very important role in the Cosmos ecosystem as they store and expose all data from genesis and/or from the most recent chain upgrade (depending on the data). Granted, we are pushing to offload more complex query patterns to indexers such as Numia. Even still, archive nodes will still serve an important role.

However, if you look at any major chain, the size of the application state ranges in the multi-TB range, some after only a year of being live, which is simply unacceptable. Even querying these nodes are awfully slow, to the point of being unusable.

This ties pretty heavily into the other major theme, IAVL & Performance, however, this warrants its own discussion as even if/when we are to separate SS and SC and improve IAVL, there is still design needed around how to store and represent historical data logically and on disk. Ref #13317.

cc @tac0turtle @ValarDragon @yihuang @musalbas

aaronc · 2022-10-13T20:08:38Z

aaronc
Oct 13, 2022
Maintainer

Thanks for getting this started @alexanderbez.

A few thoughts:

IAVL & Performance

I think the SC/SS separation generally makes sense as a direction. The fact that SMT root hashes don't depend on insertion order seems fundamentally useful if we're moving towards parallel execution eventually, although there could be other ways of dealing with this like sorting keys before insertion. On reviewing the existing SMT code, I see lots of room for optimization and I think we could have a performant store v2 if we spent more time on these details.

Re: https://hackmd.io/@1IrR8dXjR-adr5rjRkfXrg/mustafa_sdk_storage, I actually think the fact that the SDK supports iteration is pretty useful and allows us to do a bunch of stuff we couldn't otherwise do without. I haven't seen evidence that the SC layer needs iteration itself - inclusion and exclusion proofs should be sufficient. In which case I'm not sure why iteration has such a huge cost - it's supported by all the KV stores we'd be using for SS.

Historical Data & Archive Nodes

It seems that making historical data a responsibility of the same storage layer imposes a lot of constraints. Historical state is one thing IAVL is really good for, but do we ever really need historical state for state machine execution? Could this be handled with some sort of indexing service instead and the node itself just has the latest state? One concern I had about SS is how well those KV stores are actually suited for large amounts of historical data. IAVL is actually designed for historical state where as I don't think DB snapshots are actually designed for that purpose at scale.

One design that may be worth considering is what's done in Datomic which is a database with historical state similar to IAVL except I'm pretty sure they use some sort of immutable b-trees under the hood. The way the system is made scalable is by offloading the storage of tree nodes into distributed cloud storage like AWS DynamoDB or Cassandra. There's a single node which they call the transactor that writes the tree into cloud storage and then any number of peers can read portions of the tree into memory and supposedly there's pretty good performance for querying large amounts of historical data. This sort of system could work well using ADR 038 state listeners to get the data out of a node and indexed in this sort of database.

With this sort of system the node itself could basically use the fastest system possible and just maintain the latest version, but we would still get good (maybe better) historical state with this indexing system.

0 replies

musalbas · 2022-10-13T22:42:38Z

musalbas
Oct 13, 2022

Braindump of thoughts that you should take with a grain of salt:

On iteration

So the reason why iteration being a requirement imposes a performance penalty on the state storage layer, is because you need a key/value memory caching layer that's ordered, to support iteration. This means you have to use a tree-based memory caching layer like a b-tree, which is slower than an unordered caching layer that eg just uses Go maps, for random (but not sequential) reads (which I expect to be bulk of blockchain activity for many use cases, eg if you're processing transactions where you have to update accounts that are public keys).

Note though you don't need "native" iteration support in the state storage layer for a Cosmos module to do iteration over keys, if it stores the keys in a smart way. The Cosmos module for example could store sorted keys (eg the validator set) as a linked list, like this. This wouldn't have the same performance as a sequential read though as if the state storage layer supported native iteration, as reading each next node in the linked list would be a random read.

It's true you don't need iteration on the state commitment layer to support native iteration on the state storage layer though. However without iteration on the state commitment layer, you can't do efficient range Merkle proofs. These are not used by the Cosmos SDK currently, but maybe there will be a need for them in the future, in particular they are needed for certain systems for state transition fraud proofs. That being said, by storing ordered keys as linked list in an SMT, it's possible to design fraud proofs that are efficient even if iteration is required but not supported by the state commitment layer.

That being said, I'm not sure that getting rid of the ability to natively iterate in the state machine is practical or desirable (depends whether you want faster random cache reads vs faster sequential cache reads), and I wouldn't necessarily advocate for it, but it does have the above considerations. Maybe it would be practical though if there was some kind of abstracted sorted key store library in the sdk that uses a linked list behind the scenes, instead of assuming the state storage layer natively supports iteration? Then you can get rid of native iteration on the state storage layer, thus allowing you to use an (unordered) memory caching layer that's faster for random reads (but not sequential reads). A quick look at the Eth specs seems to show that they also iterate over validators. I suppose this is one the advantages of decoupling ss and sc - allowing people to make their own tradeoffs.

SMT vs IAVL

If we agree that you don't need range proofs and thus don't need iteration on the state commitment layer, then SMT's slight advantage over IAVL is that it does not need rebalancing when keys are appended, which means fewer read/write operations on updates. The current set operation in iavl has to read left/right nodes when a path is updated, to determine if rebalancing is needed. I don't know how significant of a penalty this will be in practice. https://github.com/cosmos/iavl/blob/master/mutable_tree.go#L1153

Historical state access and pruning

I think it's really important to implement pruning of orphaned nodes in the tree, which the IAVL doesn't support. This requires a lot of design work and architecting, so any store redesign needs to take this into consideration from first principles.

On the state commitment layer, if we assume that each leaf is unique, then I think you can do this by implementing a "death row" of orphaned nodes where orphaned nodes get deleted say after 100 versions. On the state storage layer, you probably also need to append a version suffix to each key, and then do an iterator query in the backing kv DB to get the latest key before version v, to query for a specific version of a key. If you want to support rollback and pruning at the same time you'll also probably need to implement some kind of "journal" that logs updates, that you'll need to undo.

I understand that store v2 work has tried to use the underlying snapshot feature of the backing db, but it is unclear if this is fit for purpose as in some databases this is meant more as a backup feature rather than a versioining feature.

(Wanting to support iteration on the state storage layer though adds some performance complexities to this that needs to be investigated, as when you iterate over a key range you'll need to efficiently "skip" over historical versions of keys you already have, that you don't care about.)

Related: https://blog.ethereum.org/2015/06/26/state-tree-pruning

7 replies

aaronc Oct 14, 2022
Maintainer

I've looked a bit into the pruning stuff in IAVL and it seems a bit confusing to me. Why don't we just do reference counting on the nodes?

alexanderbez Oct 18, 2022
Author

Reference counting would seem like the obvious way to go. Deletion on ref_count=0.

aaronc Oct 18, 2022
Maintainer

Is anyone able to explain the logic behind the current orphan key design?

peterbourgon Oct 18, 2022

So the reason why iteration being a requirement imposes a performance penalty on the state storage layer, is because you need a key/value memory caching layer that's ordered, to support iteration.

If ordered iteration is a requirement of the storage layer (it is) then that requirement should always be satisfied by every layer, down to the base storage implementation. All else equal I shouldn't need to wrap a store with a cache decorator to get iterators. Said another way: sort on write, not on read.

alexanderbez Oct 18, 2022
Author

All else equal I shouldn't need to wrap a store with a cache decorator to get iterators. Said another way: sort on write, not on read.

Agreed.

aaronc · 2022-10-14T16:31:46Z

aaronc
Oct 14, 2022
Maintainer

Adaptive Radix Tree as an alternative tree structure

I was wondering whether there's a data structure that's ordered like IAVL, but also generates root merkle hashes that don't depend on insertion order like SMT. I'm pretty sure any sort of radix tree has (or could have) this property and as I understand it SMT is a radix tree with r=2 with the key hashed. SMT is ordered, but because of the hashing, it's the order of the hashes instead of a lexicographical ordering.

I'm pretty sure an Adaptive Radix Tree could allow us to store variable length keys (without needing to hash) and have a deterministic merkle hash independent of insertion order. It also seems like it has other nice properties like point queries with performance similar to hash maps and the ability to efficiently use SIMD operations. There is also precedent for it being used as a database index store with apparently good performance: https://duckdb.org/2022/07/27/art-storage.html.

Might this be an alternative worth considering?

3 replies

tac0turtle Oct 17, 2022
Maintainer

I think we should refactor the storage layer with the thought of experimenting with different trees and commitments.

alexanderbez Oct 18, 2022
Author

Yes, this would fall under SC experimentation!

Adaptive Radix trees seem super interesting. Will research them over the coming days.

aaronc Oct 20, 2022
Maintainer

One challenge with ARTs would be constructing compact proofs. With up to 256 children per node, a naive approach could be very unwieldy. Seems like you'd need to do some logical binary tree hashing at each physical node level. That would be more CPU intensive to calculate the hash (maybe) but it could be easily parallelized. Proof length might end up being similar to SMT. We'd need to do a POC probably to see

aaronc · 2022-10-14T16:53:56Z

aaronc
Oct 14, 2022
Maintainer

Using the same tree structure throughout the stack

Another idea on my mind has been why don't we use the same tree structure through the DB stack? So IAVL is implemented on top of a KV store which generally uses an LSM tree under the hood (LevelDB, RocksDB, etc.) and then at the memory cache layer we through an in memory B-tree on top of IAVL. Am I totally crazy in thinking that this may be introducing unnecessary inefficiencies?

Starting with the storage layer, there must be stuff that LevelDB/RocksDB are doing that we don't need for an immutable AVL tree. For instance, RocksDB/LevelDB and probably most other DB's do some sort of write ahead log (WAL). In my mind this simply isn't needed because Tendermint is basically our WAL. I know we don't want to have to deal with disk level storage, but all databases are basically built on top of a data structure like an AVL tree, although usually a tree with more children like a B-tree to optimize for page sizes. But I'm pretty sure we could consider a structure with more children like a B-tree, adaptive radix tree, or jellyfish merkle tree. Also there are specific optimizations that we can make an immutable data structure like IAVL. For instance, nodes are only ever inserted or deleted, not modified. I'm not an expert in this area by any means so I could be totally off by suggesting that we might be better off just dealing with storage directly, but I think it's worth considering.

At the cache layer, we could take advantage of an immutable data structure like IAVL (or really any immutable tree), and have an in memory data structure that just builds off of the same on disk nodes and then persists itself to disk on commit. I commented on this a bit here: #13545 (reply in thread). Immutable data structures have built-in cache branching built-in. It just seems so inefficient to put a b-tree on top of IAVL and then write merge iterators when there are functional data structures that do all of that for you. (Check out the stuff done in Clojure and in particular https://github.com/clojure/data.avl which has some AVL transient designs which could be useful).

Anyway, I could be totally off base here, but maybe it's worth considering an end to end integrated on-disk/in-memory solution with a single highly optimized data structure?

7 replies

tac0turtle Oct 20, 2022
Maintainer

I feel like every blockchain project has said "lets write our own db" at some point then given up or half hearty did it and its not used. Might be something we approach in the long term. Separating SS and SC get us most of the way there.

alexanderbez Oct 21, 2022
Author

https://github.com/handshake-org/urkel

I'll just leave this here...

aaronc Oct 21, 2022
Maintainer

Single files is something I was considering. Interesting to see that they say that performs well in benchmarks. Part of what I'm trying to say is that for IAVL I don't actually think we need that advanced of a DB. Storing nodes as single files on disk with the right mechanism for pruning (maybe reference counting?) might work... Just a thought

alexanderbez Oct 23, 2022
Author

I think @ValarDragon had some similar ideas to this if I'm not mistaken.

robert-zaremba Dec 29, 2022
Collaborator

During one of the calls of store/v2 I was talking about an idea to handle merkle tree directly in the DB (eg if it would be possible to fork and change boltdb).
Having a DB layer helps to innovate on upper layers, but indeed, the end result is a tree on top of the tree.
That's why, with store/v2 we ended up with 2 store: one direct (iterations, reads ...), and other one (SC) only for writes.

peterbourgon · 2022-10-16T23:59:04Z

peterbourgon
Oct 16, 2022

A little while ago, I was asked to extract a reasonably complex amount of data from the SDK, with a sub-millisecond latency budget. Consequently, I spent probably 3 months in a deep dive of the state layer, and became quite familiar with the relevant code and it's performance pathologies.

While researching, I saw lots of claims in issues, PRs, chats, etc. that IAVL was slow. I did find one critical performance error in the package — the iterator methods spawn a goroutine and yield results over a channel — but other than that, IAVL was never a bottleneck.

There were two main bottlenecks.

First, impacting request performance perceived by users: the amount of code, and layers of abstraction, through which each request for state had to traverse, in order to produce a result. This was the main contributor to perceived slowness in the read path. A variety of design and implementation errors at each layer exacerbate the problem; the main ones that come to mind now are unnecessary allocations and over-use of package reflect.

Second, impacting resource utilization of nodes themselves: LevelDB compactions. Those compactions account for the majority of CPU utilization and disk IOPS produced by the SDK, and are the main contributor to perceived slowness in the write path. These compactions seemed to be caused by a mismatch between Tendermint's key naming scheme and LevelDB's expectations, resulting in each compaction (if I recall correctly) reading and re-writing almost every file. Most of the resource utilization of Cosmos nodes appears to be related to this inefficient use of LevelDB. Nothing else, no other code path, comes close.

These two (admittedly kind of abstract) things accounted for probably 90% of the performance issues I had to solve when interacting with the state layer. I did end up solving it, for the record: if you talk to the IAVL directly, without any of the cache-whatever layers, avoid iterators, and build the key prefixes yourself, you can get a lot of information out of a KVStore in a few hundred microseconds.

So I'm I guess surprised to see that "IAVL is slow" is just kind of assumed as a ground truth. To be clear, I'm not challenging this conclusion! I just wanted to ask for some numbers. Can that slowness be quantified? Is there a corpus of performance data, some profiles, some benchmark results, that I can look at, or ideally run myself, to understand the problems?

At the same time, when the topic is state performance, I'm surprised when fixing the query path and LevelDB pathologies aren't at the top of the priority list. I'd bet fixing those two things alone would improve performance and resource utilization of the SDK by at least an order of magnitude. (Ain't no reason a non-archival full node needs 32GB of RAM or 1TB of disk!) Am I naïve? 😅 Have these topics been reviewed before, and maybe my experience is non-representative? Or maybe is this the first you're hearing of it?

11 replies

peterbourgon Oct 18, 2022

The layers of caching and store abstractions (root-ms -> cache-ms -> gaskv -> cachekv -> kv -> iavl) and then within this you have inter and intra block caching, then you have IAVL caching, then you have DB caching. It's all messy.

And in fact if you start the journey at appd query ... then your request goes through at least 12 abstraction layers before you even reach the top level KVStore wrapper. (Ask me for details if you don't believe me!)

The latency from all of that stuff — the HTTP JSON handler, the RPC server, the ABCI proxy app con query adapters, the gRPC routers, etc. — is way more than the latency from the storage layers. To give some perspective!

tac0turtle Oct 18, 2022
Maintainer

Agree with many of these points. I opened an issue to write regression tests and get more benchmarks in place so we can have more data on the changes and the paths that are taken. We should start form a point of understanding, profiling and data on the store today then come up with a game plan before we make any drastic changes imo.

peterbourgon Oct 20, 2022

I opened an issue to write regression tests and get more benchmarks in place so we can have more data on the changes and the paths that are taken.

Please do ping me on the PRs that provide those benchmarks, if you don't mind. It can be tricky to get them to produce meaningful results. (Many benchmarks in the SDK today are invalid, in one way or another.)

alexanderbez Oct 21, 2022
Author

#13605

yzang2019 Mar 27, 2023

While researching, I saw lots of claims in issues, PRs, chats, etc. that IAVL was slow. I did find one critical performance error in the package — the iterator methods spawn a goroutine and yield results over a channel — but other than that, IAVL was never a bottleneck.

@peterbourgon Why do you believe this was an error? Do you have any recommended fix for that? We do observer slowness around iterator

yihuang · 2022-11-11T06:41:19Z

yihuang
Nov 11, 2022
Collaborator

#13837
Something related, make guarantee on the IAVL tree insertion order at commit event.

0 replies

Storage Refactor #13545

alexanderbez Oct 13, 2022

Context

Problems

IAVL & Performance

Historical Data & Archive Nodes

Replies: 6 comments · 28 replies

aaronc Oct 13, 2022 Maintainer

IAVL & Performance

Historical Data & Archive Nodes

musalbas Oct 13, 2022

On iteration

SMT vs IAVL

Historical state access and pruning

aaronc Oct 14, 2022 Maintainer

alexanderbez Oct 18, 2022 Author

aaronc Oct 18, 2022 Maintainer

peterbourgon Oct 18, 2022

alexanderbez Oct 18, 2022 Author

aaronc Oct 14, 2022 Maintainer

Adaptive Radix Tree as an alternative tree structure

tac0turtle Oct 17, 2022 Maintainer

alexanderbez Oct 18, 2022 Author

aaronc Oct 20, 2022 Maintainer

aaronc Oct 14, 2022 Maintainer

Using the same tree structure throughout the stack

tac0turtle Oct 20, 2022 Maintainer

alexanderbez Oct 21, 2022 Author

aaronc Oct 21, 2022 Maintainer

alexanderbez Oct 23, 2022 Author

robert-zaremba Dec 29, 2022 Collaborator

peterbourgon Oct 16, 2022

peterbourgon Oct 18, 2022

tac0turtle Oct 18, 2022 Maintainer

peterbourgon Oct 20, 2022

alexanderbez Oct 21, 2022 Author

yzang2019 Mar 27, 2023

yihuang Nov 11, 2022 Collaborator

alexanderbez
Oct 13, 2022

Replies: 6 comments 28 replies

aaronc
Oct 13, 2022
Maintainer

musalbas
Oct 13, 2022

aaronc Oct 14, 2022
Maintainer

alexanderbez Oct 18, 2022
Author

aaronc Oct 18, 2022
Maintainer

alexanderbez Oct 18, 2022
Author

aaronc
Oct 14, 2022
Maintainer

tac0turtle Oct 17, 2022
Maintainer

alexanderbez Oct 18, 2022
Author

aaronc Oct 20, 2022
Maintainer

aaronc
Oct 14, 2022
Maintainer

tac0turtle Oct 20, 2022
Maintainer

alexanderbez Oct 21, 2022
Author

aaronc Oct 21, 2022
Maintainer

alexanderbez Oct 23, 2022
Author

robert-zaremba Dec 29, 2022
Collaborator

peterbourgon
Oct 16, 2022

tac0turtle Oct 18, 2022
Maintainer

alexanderbez Oct 21, 2022
Author

yihuang
Nov 11, 2022
Collaborator