-
Notifications
You must be signed in to change notification settings - Fork 298
feat(data/hashmap): initial implementation #812
Conversation
Why does it promise to be better than the existing implementation? |
|
While I'm not completely against the idea of a rewrite (I wrote this library a long time ago and the proofs are more convoluted then they would be if I wrote them today), it was actually written with performance and correctness in mind, and I would want some strong evidence that the replacement is better on all fronts first.
It's just an internal implementation thing. I see you did basically the same thing without the name.
It's definitely necessary. It is there so that
You can't do the mod without it. I'll get back to this with the other points.
I'm fine with this. Obviously this wasn't around when I wrote it, and incorporating it is a good way to tie the library together.
This is a problem. I notice that you didn't write the most important function, I think you have identified that there are two data structures involved here: a fixed size hashmap, and a growable hashmap, built on the fixed hashmap much like |
An easy solution to the bucket numbering problem is to require that the number of buckets is always a power of 2, i.e. 2^n for some n in nat. This works with the automatic resizing algo because it always doubles the length, and it also makes the modulo operation more efficient (it's just a bit mask). |
Thanks for the in-depth response, @digama0.
Of course. There are quite a few lines spent on
Why is it important that
I'm not sure what you mean. I have a wrapper that allows you to use it, but I don't see why it's necessary for the def fin.lift_mod_pnat (n : ℕ+) {α} (f : α → ℕ) (a : α) : fin n :=
⟨f a % n, nat.mod_lt _ n.property⟩
def mk_empty_pos (n : ℕ+) {α} (β : α → Type v) (f : α → ℕ) : hashmap n β :=
mk_empty β (fin.lift_mod_pnat n f)
Yep.
I don't follow you. I have written
Ah, I see what you mean. That makes sense, and I'd be happy to go with two hash maps. So, consider this to be a fixed-size, and we can build the dynamically sized hash map on top of it. |
I get the efficiency of using powers of 2 for the array size, but I'm not sure what the “bucket numbering problem” is. |
I think I'm with you on the difference between the fixed- and dynamically-sized hash map sizes. In the former, the size never changes, which means it's really a parameter, not a changeable value. In the latter, the size changes and, therefore, the value should be kept. As for the hash function, I'm not sure. I have vacillated between whether it should be a parameter or a field. Since it doesn't change, it feels like a parameter is more appropriate; however, I left it as a field for convenience sake. I'd be happy to make a parameter, though. But I'm not clear why it should be treated differently for fixed- and dynamically-sized hash maps. |
I'm referring to the inconvenience of dealing with
You could have the function external for the global hashmap as well, but you will also have to externalize the validity predicate, which will make it less ergonomic. Leaving stuff out of the structure will make it easier to reason about, but less encapsulated. So the user facing version (the growable hashmap) should be encapsulated, while the internal version (the fixed hashmap) should make its assumptions clear. Additionally, externalizing the function (and leaving it out of the type of the hashmap too) means that you aren't storing unnecessary copies of it in the growable hashmap and the fixed hashmap, which is good both because it saves on space and because it avoids an unnecessary coherence condition asserting the relation between the two functions.
Probably it should be documented. But as for why to expect it, this is something I would expect of any decent hashmap implementation for a general purpose programming language. It's cheap to keep track of the information, and the alternative is crazy expensive (it's the same cost as a fold over the whole map) so it seems like a good investment. But if we have two versions of the data structure for internal and external use, then I'm fine with this only being in the user-facing data structure. |
I'm not convinced that storing the size is needed for a hash map in which the size is always determined by its structure and never used by the hash map itself. You could also make the argument that each of On the other hand, with a hash map whose structure depends on its size, such as the dynamically-sized hash map, in which the size needs to be calculated for each |
All of these examples are built on
What do you mean "used"? Data is used if it is accessed by methods from the public API, and
Are you saying that in order for me (the user) to get the size of a hashmap I have to wrap it with a number which tracks the size and prove equality myself? That's ludicrous. Show me a hashmap API in any programming language where |
|
Well okay then, color me surprised. The Haskell implementation is pretty sophisticated too, so it's not for simplicity reasons. Maybe they don't think it gets used enough to be worth it? I'm still not convinced though. Too bad it doesn't say why it was done this way. |
I think performance is the key. There's some discussion at haskell-unordered-containers/unordered-containers#138. |
Thanks for the link. Note that the Haskell implementation is recursive, so this would put the size in many places in Haskell when we only have to store it once. If the performance overhead becomes too much we can split it off, but if anything the discussion there indicates that this is actually an unusual choice for Haskell. We don't have nearly as many compiler optimizations that would make the relative performance worse. In particular Anyway, stop trying to remove functionality. As I said, this won't be merged unless it's better in every way, no regressions allowed. |
What is the next step here? |
My understanding of the above is that this implementation is intended as a fixed-size hash map, and there is still going to be a dynamically-sized analogue (which bundles Is that correct? |
I currently have no plans to continue this work. Please feel free to do with it what you will. |
@spl What changed your mind about this PR? |
I would like to see this (or something like it) eventually replace the existing
data/hash_map.lean
. This is just the beginning: there are plenty more definitions and theorems to be added, but I think the foundation is good and ready for review.Some thoughts:
hashmap
fits better thanhash_map
with the current naming scheme found in the core Lean library as well as mathlib.hashmap
that I explored. While it's not exactly documentation on the existing work, I think it does provide a useful perspective.list.nodupkeys_join
to simplify proofs. With the existing theorem, there ends up being several extrapairwise_map
s, both inlist.nodupkeys_join
itself and in theorems that use it.