Storage Refactor #13545
Replies: 6 comments 28 replies
-
Thanks for getting this started @alexanderbez. A few thoughts: IAVL & PerformanceI think the SC/SS separation generally makes sense as a direction. The fact that SMT root hashes don't depend on insertion order seems fundamentally useful if we're moving towards parallel execution eventually, although there could be other ways of dealing with this like sorting keys before insertion. On reviewing the existing SMT code, I see lots of room for optimization and I think we could have a performant store v2 if we spent more time on these details. Re: https://hackmd.io/@1IrR8dXjR-adr5rjRkfXrg/mustafa_sdk_storage, I actually think the fact that the SDK supports iteration is pretty useful and allows us to do a bunch of stuff we couldn't otherwise do without. I haven't seen evidence that the SC layer needs iteration itself - inclusion and exclusion proofs should be sufficient. In which case I'm not sure why iteration has such a huge cost - it's supported by all the KV stores we'd be using for SS. Historical Data & Archive NodesIt seems that making historical data a responsibility of the same storage layer imposes a lot of constraints. Historical state is one thing IAVL is really good for, but do we ever really need historical state for state machine execution? Could this be handled with some sort of indexing service instead and the node itself just has the latest state? One concern I had about SS is how well those KV stores are actually suited for large amounts of historical data. IAVL is actually designed for historical state where as I don't think DB snapshots are actually designed for that purpose at scale. One design that may be worth considering is what's done in Datomic which is a database with historical state similar to IAVL except I'm pretty sure they use some sort of immutable b-trees under the hood. The way the system is made scalable is by offloading the storage of tree nodes into distributed cloud storage like AWS DynamoDB or Cassandra. There's a single node which they call the transactor that writes the tree into cloud storage and then any number of peers can read portions of the tree into memory and supposedly there's pretty good performance for querying large amounts of historical data. This sort of system could work well using ADR 038 state listeners to get the data out of a node and indexed in this sort of database. With this sort of system the node itself could basically use the fastest system possible and just maintain the latest version, but we would still get good (maybe better) historical state with this indexing system. |
Beta Was this translation helpful? Give feedback.
-
Braindump of thoughts that you should take with a grain of salt: On iterationSo the reason why iteration being a requirement imposes a performance penalty on the state storage layer, is because you need a key/value memory caching layer that's ordered, to support iteration. This means you have to use a tree-based memory caching layer like a b-tree, which is slower than an unordered caching layer that eg just uses Go maps, for random (but not sequential) reads (which I expect to be bulk of blockchain activity for many use cases, eg if you're processing transactions where you have to update accounts that are public keys). Note though you don't need "native" iteration support in the state storage layer for a Cosmos module to do iteration over keys, if it stores the keys in a smart way. The Cosmos module for example could store sorted keys (eg the validator set) as a linked list, like this. This wouldn't have the same performance as a sequential read though as if the state storage layer supported native iteration, as reading each next node in the linked list would be a random read. It's true you don't need iteration on the state commitment layer to support native iteration on the state storage layer though. However without iteration on the state commitment layer, you can't do efficient range Merkle proofs. These are not used by the Cosmos SDK currently, but maybe there will be a need for them in the future, in particular they are needed for certain systems for state transition fraud proofs. That being said, by storing ordered keys as linked list in an SMT, it's possible to design fraud proofs that are efficient even if iteration is required but not supported by the state commitment layer. That being said, I'm not sure that getting rid of the ability to natively iterate in the state machine is practical or desirable (depends whether you want faster random cache reads vs faster sequential cache reads), and I wouldn't necessarily advocate for it, but it does have the above considerations. Maybe it would be practical though if there was some kind of abstracted sorted key store library in the sdk that uses a linked list behind the scenes, instead of assuming the state storage layer natively supports iteration? Then you can get rid of native iteration on the state storage layer, thus allowing you to use an (unordered) memory caching layer that's faster for random reads (but not sequential reads). A quick look at the Eth specs seems to show that they also iterate over validators. I suppose this is one the advantages of decoupling ss and sc - allowing people to make their own tradeoffs. SMT vs IAVLIf we agree that you don't need range proofs and thus don't need iteration on the state commitment layer, then SMT's slight advantage over IAVL is that it does not need rebalancing when keys are appended, which means fewer read/write operations on updates. The current set operation in iavl has to read left/right nodes when a path is updated, to determine if rebalancing is needed. I don't know how significant of a penalty this will be in practice. https://github.com/cosmos/iavl/blob/master/mutable_tree.go#L1153 Historical state access and pruningI think it's really important to implement pruning of orphaned nodes in the tree, which the IAVL doesn't support. This requires a lot of design work and architecting, so any store redesign needs to take this into consideration from first principles. On the state commitment layer, if we assume that each leaf is unique, then I think you can do this by implementing a "death row" of orphaned nodes where orphaned nodes get deleted say after 100 versions. On the state storage layer, you probably also need to append a version suffix to each key, and then do an iterator query in the backing kv DB to get the latest key before version v, to query for a specific version of a key. If you want to support rollback and pruning at the same time you'll also probably need to implement some kind of "journal" that logs updates, that you'll need to undo. I understand that store v2 work has tried to use the underlying snapshot feature of the backing db, but it is unclear if this is fit for purpose as in some databases this is meant more as a backup feature rather than a versioining feature. (Wanting to support iteration on the state storage layer though adds some performance complexities to this that needs to be investigated, as when you iterate over a key range you'll need to efficiently "skip" over historical versions of keys you already have, that you don't care about.) Related: https://blog.ethereum.org/2015/06/26/state-tree-pruning |
Beta Was this translation helpful? Give feedback.
-
Adaptive Radix Tree as an alternative tree structureI was wondering whether there's a data structure that's ordered like IAVL, but also generates root merkle hashes that don't depend on insertion order like SMT. I'm pretty sure any sort of radix tree has (or could have) this property and as I understand it SMT is a radix tree with r=2 with the key hashed. SMT is ordered, but because of the hashing, it's the order of the hashes instead of a lexicographical ordering. I'm pretty sure an Adaptive Radix Tree could allow us to store variable length keys (without needing to hash) and have a deterministic merkle hash independent of insertion order. It also seems like it has other nice properties like point queries with performance similar to hash maps and the ability to efficiently use SIMD operations. There is also precedent for it being used as a database index store with apparently good performance: https://duckdb.org/2022/07/27/art-storage.html. Might this be an alternative worth considering? |
Beta Was this translation helpful? Give feedback.
-
Using the same tree structure throughout the stackAnother idea on my mind has been why don't we use the same tree structure through the DB stack? So IAVL is implemented on top of a KV store which generally uses an LSM tree under the hood (LevelDB, RocksDB, etc.) and then at the memory cache layer we through an in memory B-tree on top of IAVL. Am I totally crazy in thinking that this may be introducing unnecessary inefficiencies? Starting with the storage layer, there must be stuff that LevelDB/RocksDB are doing that we don't need for an immutable AVL tree. For instance, RocksDB/LevelDB and probably most other DB's do some sort of write ahead log (WAL). In my mind this simply isn't needed because Tendermint is basically our WAL. I know we don't want to have to deal with disk level storage, but all databases are basically built on top of a data structure like an AVL tree, although usually a tree with more children like a B-tree to optimize for page sizes. But I'm pretty sure we could consider a structure with more children like a B-tree, adaptive radix tree, or jellyfish merkle tree. Also there are specific optimizations that we can make an immutable data structure like IAVL. For instance, nodes are only ever inserted or deleted, not modified. I'm not an expert in this area by any means so I could be totally off by suggesting that we might be better off just dealing with storage directly, but I think it's worth considering. At the cache layer, we could take advantage of an immutable data structure like IAVL (or really any immutable tree), and have an in memory data structure that just builds off of the same on disk nodes and then persists itself to disk on commit. I commented on this a bit here: #13545 (reply in thread). Immutable data structures have built-in cache branching built-in. It just seems so inefficient to put a b-tree on top of IAVL and then write merge iterators when there are functional data structures that do all of that for you. (Check out the stuff done in Clojure and in particular https://github.com/clojure/data.avl which has some AVL transient designs which could be useful). Anyway, I could be totally off base here, but maybe it's worth considering an end to end integrated on-disk/in-memory solution with a single highly optimized data structure? |
Beta Was this translation helpful? Give feedback.
-
A little while ago, I was asked to extract a reasonably complex amount of data from the SDK, with a sub-millisecond latency budget. Consequently, I spent probably 3 months in a deep dive of the state layer, and became quite familiar with the relevant code and it's performance pathologies. While researching, I saw lots of claims in issues, PRs, chats, etc. that IAVL was slow. I did find one critical performance error in the package — the iterator methods spawn a goroutine and yield results over a channel — but other than that, IAVL was never a bottleneck. There were two main bottlenecks. First, impacting request performance perceived by users: the amount of code, and layers of abstraction, through which each request for state had to traverse, in order to produce a result. This was the main contributor to perceived slowness in the read path. A variety of design and implementation errors at each layer exacerbate the problem; the main ones that come to mind now are unnecessary allocations and over-use of package reflect. Second, impacting resource utilization of nodes themselves: LevelDB compactions. Those compactions account for the majority of CPU utilization and disk IOPS produced by the SDK, and are the main contributor to perceived slowness in the write path. These compactions seemed to be caused by a mismatch between Tendermint's key naming scheme and LevelDB's expectations, resulting in each compaction (if I recall correctly) reading and re-writing almost every file. Most of the resource utilization of Cosmos nodes appears to be related to this inefficient use of LevelDB. Nothing else, no other code path, comes close. These two (admittedly kind of abstract) things accounted for probably 90% of the performance issues I had to solve when interacting with the state layer. I did end up solving it, for the record: if you talk to the IAVL directly, without any of the cache-whatever layers, avoid iterators, and build the key prefixes yourself, you can get a lot of information out of a KVStore in a few hundred microseconds. So I'm I guess surprised to see that "IAVL is slow" is just kind of assumed as a ground truth. To be clear, I'm not challenging this conclusion! I just wanted to ask for some numbers. Can that slowness be quantified? Is there a corpus of performance data, some profiles, some benchmark results, that I can look at, or ideally run myself, to understand the problems? At the same time, when the topic is state performance, I'm surprised when fixing the query path and LevelDB pathologies aren't at the top of the priority list. I'd bet fixing those two things alone would improve performance and resource utilization of the SDK by at least an order of magnitude. (Ain't no reason a non-archival full node needs 32GB of RAM or 1TB of disk!) Am I naïve? 😅 Have these topics been reviewed before, and maybe my experience is non-representative? Or maybe is this the first you're hearing of it? |
Beta Was this translation helpful? Give feedback.
-
#13837 |
Beta Was this translation helpful? Give feedback.
-
Context
My motivation for starting this discussion is to have a more open forum for discussion around the future of the store package in the SDK and the general storage architecture without polluting the existing Storage EPIC.
At the time of this writing, the Storage EPIC only contains high level ideas for what we currently envision needs to happen.
I believe it would be beneficial for the community as a whole if we discussed the major pain points with the current architectural modal and discuss approaches to tackle each and then come to consensus as a dev community for what we believe to be the best approach for each, striking a good balance between dev and client UX and performance.
Problems
IAVL & Performance
Today the SDK uses a logical merkle-ized store per-module, that store being an IAVL store. The IAVL store, and thus per module, handles both State Commitment (SC) and State Storage (SS). There a few concerns with this approach:
Historical Data & Archive Nodes
Archive nodes serve a very important role in the Cosmos ecosystem as they store and expose all data from genesis and/or from the most recent chain upgrade (depending on the data). Granted, we are pushing to offload more complex query patterns to indexers such as Numia. Even still, archive nodes will still serve an important role.
However, if you look at any major chain, the size of the application state ranges in the multi-TB range, some after only a year of being live, which is simply unacceptable. Even querying these nodes are awfully slow, to the point of being unusable.
This ties pretty heavily into the other major theme, IAVL & Performance, however, this warrants its own discussion as even if/when we are to separate SS and SC and improve IAVL, there is still design needed around how to store and represent historical data logically and on disk. Ref #13317.
cc @tac0turtle @ValarDragon @yihuang @musalbas
Beta Was this translation helpful? Give feedback.
All reactions