Reading for 10/26: Fancy Memory Management #406

evanmwilliams · 2023-10-20T13:41:37Z

evanmwilliams
Oct 20, 2023

Hi everyone!

Here's the discussion thread for the Fancy Memory Management paper discussion on Thursday October 26. The paper can be found here.

Post any thoughts, questions, or comments you might have before the discussion. Looking forward to it!

rcplane · 2023-10-25T17:09:48Z

rcplane
Oct 25, 2023

I was impressed by the cleverness of hiding the time cost of redirection for compaction without relocation in the existing virtual page lookup TLB, resulting 16% memory savings with 2% runtime overhead is a tradeoff appealing to a wide range of applications.
With Mesh, insertion takes 1.76s, while with Redis’s default of jemalloc, insertion takes 1.72s. Redis under Hoard and tcmalloc has the same average heap size as Mesh with meshing disabled (under 2% difference), and both allocators are similarly within 2% of the insertion speed of jemalloc.
It was also nice to be able to watch the author present at Microsoft Research and field questions relating to how reduced locality isn't as much of a concern for common access patterns when operating at the page sizes they do, since memory locality within objects is useful far more often than locality across page chunks.

3 replies

emwangs Oct 26, 2023

I enjoyed listening to this talk as well! Yeah it seems like there was some anecdotal evidence that that locality reduction from randomization negligibly affects performance, but I'd love to look into if there are numbers on it -- unfortunately a quick cursory search did not help too much! One of the question-askers said he worked in computer security, particularly in what I interpreted to be something similar to Address space layout randomization (ASLR), and I thought it was super interesting about how this concept of memory randomization has appeared in other applications. :)

sampsyo Oct 26, 2023
Maintainer

There certainly is a similarity here! Notably, Emery Berger has also worked on randomizing memory layouts specifically for security reasons, in the high-profile DieHard work.

MelindaFang-code Oct 26, 2023

would it be that in modern CPUs randomized memory allocation would not hurt performance that much because the whole page is cached so selecting any random offsets from the page would be good enough?

keikun555 · 2023-10-25T18:00:50Z

keikun555
Oct 25, 2023

The paper doesn't give us much context on what a Robson bound is. That led me to this paper which gives upper and lower bounds for "the amount of store necessary to operate a dynamic storage allocation system, subject to certain constraints, with no risk of breakdown due to storage fragmentation."

Robson defines

N*(M,n) defined as the size of store needed to implement a dynamic storage allocation strategy without possibility of breakdown where storage requests could be for arbitrary block sizes up to n words and the total number of words in use could not exceed M.

And then puts lower and upper bounds on them in the second page. These are really hairy equations.

I think in big-O notation the lower bound is $O((M-n)\log n)$ and the upper bound is, for sufficiently large $K$, $O(M+K\log n + n^2)$

0 replies

keikun555 · 2023-10-25T18:14:11Z

keikun555
Oct 25, 2023

The introduction writes that

a program may stash addresses in integers, store flags in the low bits of aligned addresses, perform arithmetic on addresses and later reference them, or even store addresses to disk and later reload them.

I can imagine why we would do this when there is a strict limit to memory we can have in a program. I'm not sure, but maybe early Nintendo games used these techniques when the address space was small? I wonder if any entity is doing such heinous optimizations in the current world and why.

6 replies

AliceSzzze Oct 26, 2023

I don't know enough about C/C++, but I wonder if it is possible to "solemnly swear" that you won't use any addresses / pointers that the compiler don't know about and enable relocation, for programmers who don't stash addresses in integers etc.?

edit: nevermind, clearly people have thought about relocating GC for C/C++ in the absence of raw pointers

Detlefs shows that if developers use annotations in the form of smart pointers, C++ code can also be managed with a relocating garbage collector [10]. Edelson introduced GC support through a combination of automatically generated smart pointer classes and compiler transformations that support relocating GC

evanmwilliams Oct 26, 2023
Author

Indeed - you can in fact promise to not use any addresses that the compiler doesn't know about. This is precisely what Rust does in a sense. You are not allowed to have dangling pointers lying around, else the program won't compile (there's also a notion of unsafe code that lets you get around this in Rust but that's not too important).

The one issue with this is that static analysis techniques are inherently conservative. The compiler can't tell with 100% certainty that you won't do something bad with memory even if you know you won't, so the Rust compiler might reject programs that are actually perfectly fine.

sampsyo Oct 26, 2023
Maintainer

While I think this is oversimplifying what you said, you could read this as the (often repeated!) argument that "machines have plenty of memory now so we don't have to worry about going to extremes to save a few bytes." I strongly disagree with that statement. In almost all systems, from tiny embedded ones to huge datacenter-scale servers, memory bandwidth is a (the?) primary limiter on performance. So while merely fitting your application in memory may not be the problem, getting good performance out of your memory system is still very much a problem in 2023—and wacky pointer-compression tricks are very much part of the solution to this problem.

One fun example is V8's work on making pointers take up less than a 64-bit word: https://v8.dev/blog/pointer-compression

A particularly cute technique, if you haven't seen it before, is the "XOR doubly linked list": https://en.wikipedia.org/wiki/XOR_linked_list

SanjitBasker Oct 26, 2023

Adding onto this, I personally think that decreasing memory usage is an interesting theoretical question. During the summer one of my friends told me about tiny pointers. Complex data structures like this would be quite difficult to analyze conservatively, so we might end up in a situation where programmers label all their code as unsafe.

keikun555 Oct 26, 2023

To @AliceSzzze 's comment about telling the compiler that the programmer won't be doing the mentioned actions, I wonder the dual -- are there primitives that tell the compiler that "this pointer might look dead, but it's actually not because I stored it somewhere else"?

keikun555 · 2023-10-25T18:47:56Z

keikun555
Oct 25, 2023

The paper only compares with Firefox, Redis, and SPECint2006. For SPECint2006, the authors write that

Most of the SPEC benchmarks are not particularly compelling targets for Mesh because they have small overall footprints and do not exercise the memory allocator.

This is surprising because this paper is fairly new (2019). Are there really no benchmarks that folks can use to measure memory operations?

4 replies

evanmwilliams Oct 26, 2023
Author

Agreed! I was also a bit confused by the notion that the SPEC benchmarks are not compelling targets - they said in the paper they also custom-created benchmarks to verify that randomization actually does provide a benefit, which I was also skeptical about since in 2019 you'd think there'd be more benchmarks with varying access patterns

jdroob Oct 26, 2023

I suppose the authors were more interested in showing how well Mesh can perform on real-world applications that present significant memory-management challenges that Mesh was designed to address. I also suspect that the authors felt that readers might be more impressed seeing Mesh's performance improvements on applications they're familiar with. But to your point - I'd also like see how Mesh performs on a standardized set of benchmarks.

sampsyo Oct 26, 2023
Maintainer

FWIW, SPEC benchmarks were historically designed to test CPU performance and many of them don't seriously stress the memory system.

keikun555 Oct 26, 2023

I wonder if any of the scientific computing benchmarks are memory intensive. Maybe not the linpack benchmark since those are cpu heavy; maybe the graph 500 or the mini-apps benchmarks could be candidates for a follow-up study.

ryanwmao · 2023-10-25T20:11:13Z

ryanwmao
Oct 25, 2023

I thought that the paper offered an intriguing approach to tackling memory fragmentation, a prevalent issue in C/C++ applications. By presenting a novel memory allocator that opportunistically merges disjoint free blocks, they provide a potential solution that bridges the gap between traditional memory allocators and garbage collectors. One key question that arises is how generalizable this approach is across a wide variety of real-world applications and workloads. While their results are promising, it would be interesting to explore the overhead introduced by this system in different scenarios and whether certain use cases might present edge cases where Mesh's benefits diminish. Additionally, how does Mesh perform in multi-threaded or concurrent applications, which have their own unique set of challenges?

3 replies

emwangs Oct 26, 2023

Yep! I think this relates pretty closely to @xalbt 's post -- I would have liked to see some more analyses and insight into what the memory activity looked for each of the benchmarks. Having no benchmarks where there's performance decrease certainly makes me suspect 🤔. The Firefox benchmark is multi-threaded; however they did not delve too deep into what that benchmark actually looks like. Good point!

he-andy Oct 26, 2023

I think you bring up a fair point about memory overhead. The authors chose not to include smaller memory footprint benchmarks (like SPEC) likely because the memory overhead would outweigh the benefits of defragmentation. However, on smaller scales I feel like memory defragmentation isn't much of a problem anyways and the paper is definitely geared more towards processes with larger memory footprints, as the choice of benchmarks suggest. Your point about multi-threaded applications is interesting though, and although the scheme they proposed in the paper seems reasonable, I wonder if there are any pathological cases that would seriously impede the real-world running time.

collinzrj Oct 26, 2023

I feel like memory fragmentation is a problem orthogonal to garbage collection, so I'm curious about whether it's possible to combine Mesh into the design of garbage collectors to solve memory fragmentation. Since the garbage collector has more control over when to free an object, maybe we can have better optimizations.

stephenverderame · 2023-10-25T20:12:37Z

stephenverderame
Oct 25, 2023

Like @rcplane, I was also pretty impressed with the clever way the authors perform relocation. One thing I think that this paper made me realize is how big of a deal fragmentation is. Mesh itself introduces quite a few opportunities for increased memory usage. For example, allocating a 32 and 64-byte object requires 8Kb due to needing 2 spans, a 33-byte object needs to be rounded up into the next object size, and the memory needed for the allocation bitmaps and other required metadata. I suppose I never needed to think about fragmentation too much, but seeing the significant overall memory savings while also needing extra memory made this stick out to me.

With that being said when running on larger programs that need a lot of memory, the extra page per object size, per thread, and other additional memory introduced by Mesh probably doesn't account for much of the total memory usage anyway.

1 reply

AliceSzzze Oct 26, 2023

I second this, I was surprised when I saw that "Google reports that more than 99 percent of Chrome crashes are due to running out of memory when attempting to display a web page" - until now I just thought that the engineers wrote buggy code every time Chrome crashed...

xalbt · 2023-10-26T02:11:23Z

xalbt
Oct 26, 2023

Like what others commented, I am really impressed by the memory usage reduction by 16% with only a 2% performance hit. I think Mesh would be a really good drop-in replacement for a system's implementation of malloc for general purpose use. For programs like webservers, in-memory caches, CPython or JVM, and web browsers where memory can't be carefully managed and optimized, Mesh could provide this huge decrease to memory usage while barely decreasing performance. However, for high performance applications where allocations and deallocations are very carefully managed and kept to a minimum, like game engines, Mesh would probably be a lot less useful, but these applications usually require custom memory allocation schemes.

Also, I am really intrigued by Mesh's use of carefully placed locks to avoid using a stop-the-world scheme. It does mention some concurrent garbage collectors that also don't use stop-the-world schemes, but these also add a decent amount of overhead, that Mesh doesn't seem to add. It does seem like the locks are more fine-grained, but it's not super clear to me how exactly it avoids the same level of overhead as garbage collectors.

2 replies

emwangs Oct 26, 2023

It's possible that the runtime of meshing is low enough that that's what causes the overhead to be negligible. But this post did lead me to think about: I'd like to see is some analysis on whether the benchmarks they used are read-heavy or write-heavy; if they are read-heavy then it seems almost like Mesh will have near zero impact on performance as they only need to put locks on writes. The Redis benchmark doesn't go too much into detail, but it seems like they are testing a fair amount of writes? I think this may be a part that the paper could have elaborated a bit more on!

sampsyo Oct 26, 2023
Maintainer

Yeah, I tend to agree that a little more detail on the memory traffic for the benchmark runs would have been informative.

willwng · 2023-10-26T02:19:13Z

willwng
Oct 26, 2023

I'm especially fascinated with the use of randomization for memory management; I'm used to the paradigm of programming languages and compilers seeking to attain completely verified correctness, but it seems like there are plenty of performance gains to be made if we relax some guarantees. I'm wondering if there are other applications of randomization/approximate algorithms in PL?

In practice, I would feel very comfortable with probabilities claimed in the paper. My opinion is if one wants better certainty than say $10^{-152}$ (obviously an inexact bound), they will have to make more sacrifices elsewhere in the code. I know NASA has their The Power of 10 Rules for writing more secure code, one of the being avoiding heap allocation entirely, and another being refraining from using pointers. I also think that garbage collectors are probably safer to use randomized algorithms with; I assume the worse case scenario is running out of memory in which case its more obvious that the fault was in the GC.

4 replies

vivianyyd Oct 26, 2023

I'm wondering this too

evanmwilliams Oct 26, 2023
Author

For sure! I think there are a lot of interesting applications of randomization in programming language implementation. Some that immediately come to mind are algorithms for instruction scheduling, where the problems are usually NP-Hard or worse, and randomization can be used to give a "good enough" solution.

On another vein, this isn't quite a randomized algorithm but there have been a lot of uses of machine learning to help implement compiler optimizations as well. This article shows an interesting tool to help do so. Obviously a lot of these models have some degree of non-determinism, and it's interesting to see how they can be used to achieve performance gains while still providing tight guarantees that the generated code runs correctly.

bcarlet Oct 26, 2023

Aside from performance considerations, randomization can also be important for security. For example, in address space layout randomization, DoS-resistant hashing, etc. The paper mentions that previous memory allocators that employed randomization for the sake of security incurred large space overheads, whereas Mesh uses less space when randomization is enabled. The authors didn't discuss the implications of this w.r.t. security, but I wonder if it could be another potential advantage of Mesh.

sampsyo Oct 26, 2023
Maintainer

Yeah, it's interesting to think about what the (collateral) security implications are of Mesh's randomization. I don't have a clear picture of what they might be exactly. But I am, even though I have linked to it before, going to link again to the DieHard paper, which is the main work they are referring to about playing this type of game specifically for security reasons.

vivianyyd · 2023-10-26T03:00:24Z

vivianyyd
Oct 26, 2023

I found the analysis section super enjoyable to read. It was neat that for a tool with such improvements on performance with little overhead, the proofs of theoretical guarantees were quite simple/elegant.

After our first discussion, I've been noticing whether papers do a good job of describing experimental setup. I appreciated that this one in particular does pretty well.

2 replies

emwangs Oct 26, 2023

Agreed! Rarely do I read a paper with so many components from many different classes :D. Also like that they talk about experimental setup and acknowledge a wide variety benchmarks that seem to reflect that Mesh has little impact in some cases and strong impact in others, while providing some reasonable (?) justifications.

bennyrubin Oct 26, 2023

After reading the section, I felt that I could 100% replicate the results and knew exactly what systems to use and what version of everything. Unfortunately, I feel like that's all too rare these days, especially for systems papers where reproducibility should be the name of the game! I know it's probably not as bad as in the natural sciences, but I'm curious how reproducible most systems papers are and I wish we saw more "reproducibility papers" - this goes back, as Vivian was saying, to the first discussion. Another thing I always wonder about is not necessarily reproducibility, but how sensitive the results are to the specific setup. For example, let's say I can exactly replicate their results with their specific version of chrome or linux, but by slightly changing the versions all the results change. If that's the case, it begs the question how fundamental the research is and how useful it is to the community as a whole.

obhalerao · 2023-10-26T03:59:48Z

obhalerao
Oct 26, 2023

I thought this was a really cool paper overall! The way that algorithm analysis was integrated with lower-level systems concepts in the presentation of this paper was one fact that particularly intrigued me—it’s not often that you see a reduction to an NP-Hard problem and detailed descriptions of how paging works in the same paper. I also really liked the way the authors analyzed their method: profiling real-world use cases beyond standard benchmark suites is certainly something to commend.

1 reply

evanmwilliams Oct 26, 2023
Author

Agreed! On a slightly nerdy note, I think this paper was cool because it stretched my brain across a lot of different subjects I've had to learn over the past few years in the CS major. Definitely makes it seem a bit more "worth it" to know that beyond undergrad all of these core principles are used extensively to do interesting work :)

matth2k · 2023-10-26T04:34:20Z

matth2k
Oct 26, 2023

This is the best paper we've read so far in my opinion. It is a really clever use of mmap() to map multiple pages to a temporary file descriptor. Moreover, the latency of the meshing operation is pretty low, on the order of milliseconds.

However, I wonder if relocating the physical pages can cause some momentary cache thrashing. Maybe as a portion of total runtime, additional cache misses are very small. But I'd image that MESH still wouldn't be good for programs that need very consistent pacing with no stutter, like video games.

2 replies

Enochen Oct 26, 2023

Yeah, I would have loved to see more data on the distribution of runtime overhead. I noticed that their results for Redis did state that their longest pause was 22ms, which is not terrible? However, I didn't see a similar stat for the Redis run without MESH, so I'm not sure if MESH led to an improvement in worst-case latency. The workload (and expectations) for something like video games for example could be wildly different, but something in that range could totally sustain 30fps I think.
I agree that the meshing approach can cause the cache thrashing problem as you pointed out, but I'm definitely curious about how its performance compares to that of existing strategies. When I used to play video games that were heavy on simulation (and didn't have loading screens to cover up delays), I would notice periodic stutters that I assume are from GC/defragmentation. For long play sessions I don't think you can just forgo processes like defragmentation. In those cases MESH probably wouldn't be introducing a new issue and this kind of approach may be still worth the reduction in memory usage.

sampsyo Oct 26, 2023
Maintainer

Interesting point about the loss of temporal locality. Maybe the high-order bit here is that the impact of losing previously-cached data (in a physically mapped cache, at least) is dominated by the cost of the meshing pause itself? 22ms is a lot longer, for example, than what you'd expect to lose by kicking a few things out of the cache. But it does make me wonder if a little bit of architectural support could help… that is, by just eagerly "re-caching" stuff when it's relocated instead of lazily incurring a bunch of misses.

zachary-kent · 2023-10-26T07:45:37Z

zachary-kent
Oct 26, 2023

This was a great read! I did have a couple thoughts:

Arena allocation is a classic approach to mitigate heap fragmentation in C++; I've seen it used extensively throughout Clang and Swift for AST node allocation. It offers other benefits as well: great spatial locality and, if implemented using a bump allocator, very short allocation latencies. When do you think automatic defragmentation (like Mesh) is preferable to Arena allocation?
It would have been nice to see an architecture diagram showing how everything fits together. I ended up drawing one myself with arrows between spans, mini heaps, thread local heaps, and the global heap; this made understanding the algorithms presented much easier for me.
Do you think it's possible to acceleration allocation, deallocation, or meshing on specialized hardware? No obvious solution jumped out at me besides a stray thought that you could implement the occupancy bitmaps with dedicated registers; this in of itself doesn't buy you anything (and it may be difficult to deal with differently sized objects in this scheme), but more hardware support seems like it could be a big win in terms of reducing defragmentation pauses.

2 replies

evanmwilliams Oct 26, 2023
Author

I agree - an architecture diagram would have been really nice! If you have one you are willing to share, feel free to post it here :)

Arena allocation is an interesting alternative - I haven't thought too deeply about meshing versus arena allocation. Intuitively maybe you can combine them? Even with arena allocation you might still get some fragmentation.

I'm sure you could apply meshing to accelerators on specialized hardware as well, so long as the memory layout is reasonable. This would probably be useful in edge devices, where memory constraints are much tighter.

sampsyo Oct 26, 2023
Maintainer

When do you think automatic defragmentation (like Mesh) is preferable to Arena allocation?

Indeed! Hopefully, arenas like the ones used for ASTs or whatever in Clang are fairly low-fragmentation—merely because stuff rarely gets deleted at random. Cases where arenas (or at least bump-allocated arenas) just can't help very much, against fragmentation, include cases where objects have widely varying sizes and get deleted about as often as they get allocated. That certainly isn't true of Clang ASTs, but is maybe true of, like, webpage DOM nodes in Firefox. It seems to really depend on the application domain.

Do you think it's possible to acceleration allocation, deallocation, or meshing on specialized hardware?

Hardware support could be really awesome! And definitely seems at least somewhat tractable.

On the specific question of accelerating memory allocation writ large, check out this awesome "mallacc" (malloc accelerator) paper.

alifarahbakhsh · 2023-10-26T13:55:14Z

alifarahbakhsh
Oct 26, 2023

The paper uses the canonical idea of "randomize when you are (literally) stuck" in a clever way. Adding some random jitter when you are backing off during transmission, or throwing some local coins when you are in a deadlock, are examples of the same mindset. Basically, if you are trying to fit all possible scenarios into the same bag, you have to pay the price for the worst one among them, whereas by randomizing you give yourself a higher chance of dodging that price.

When compared to the GC dichotomy we've been seeing, namely RC vs mark'n'sweep, this paper seems to take a more holistic approach to memory management as opposed to just collecting garbage. I guess this is the key perspective that has enabled them to present a seemingly new approach that demonstrates new potentials and possibilities. Taken as a whole, memory management starting from allocation to collection provides us with more options and knobs to play with.

0 replies

Reading for 10/26: Fancy Memory Management #406

Replies: 13 comments · 30 replies

sampsyo Oct 26, 2023 Maintainer

evanmwilliams Oct 26, 2023 Author

sampsyo Oct 26, 2023 Maintainer

evanmwilliams Oct 26, 2023 Author

sampsyo Oct 26, 2023 Maintainer

sampsyo Oct 26, 2023 Maintainer

evanmwilliams Oct 26, 2023 Author

sampsyo Oct 26, 2023 Maintainer

evanmwilliams Oct 26, 2023 Author

sampsyo Oct 26, 2023 Maintainer

evanmwilliams Oct 26, 2023 Author

sampsyo Oct 26, 2023 Maintainer

Replies: 13 comments 30 replies

sampsyo Oct 26, 2023
Maintainer

evanmwilliams Oct 26, 2023
Author

sampsyo Oct 26, 2023
Maintainer

evanmwilliams Oct 26, 2023
Author

sampsyo Oct 26, 2023
Maintainer

sampsyo Oct 26, 2023
Maintainer

evanmwilliams Oct 26, 2023
Author

sampsyo Oct 26, 2023
Maintainer

evanmwilliams Oct 26, 2023
Author

sampsyo Oct 26, 2023
Maintainer

evanmwilliams Oct 26, 2023
Author

sampsyo Oct 26, 2023
Maintainer