-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust memfd offset allocation direction properly #11386
Open
copybara-service
wants to merge
1
commit into
master
Choose a base branch
from
test/cl719036542
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Background Hello gVisor community! This is Xuzhou (Joe) from Snowflake Inc. We are currently working on utilizing gVisor internally as our secured sandboxing mechanism. We have met in-person with gVisor team last October sharing our use case and experiences with the gVisor team. As part of the meeting, we (Snowflake team) committed to contribute our internal fixes/improvements back to the upstream. As part of compatibility testing with our previous mechanism, we found some behavior discrepancies between gVisor-emulated kernel and native linux kernel when making mmap system calls. Thus wanted to raise a pull request and see if this makes sense. ## Issue we observed Exactly the same workload creates ~100 kernel VMA entries with native linux kernel, while it creates ~70,000 VMA entries under gVisor. ## Sample workload ``` def alloc(mb): # Append rows to the DataFrame num_rows = 1024 * 1024 * mb # Initialize an empty list to store rows rows = [] # Collect rows in the list (instead of using pd.concat in every iteration) for i in range(num_rows): rows.append(['*']) # Append row as a list # Create the DataFrame all at once return "123" # This would allocate around 20GiB memory alloc(250); ``` When directly running above python job with `python3 test_mmap.py`, this process at most creates around 129 VMA entries by counting the entries under `/proc/<pid>/maps`: <img width="1682" alt="Screenshot 2025-01-13 at 4 49 53 PM" src="https://github.com/user-attachments/assets/c886a07d-2cfc-4f09-ac15-d064afe5c3ee" /> However, when running it on gVisor, gVisor creates around 74,067 entries, which is 600 times more entries. We have `/proc/sys/vm/max_map_count` file set to `65,530` by default, thus workload like this will trigger sandbox crash: <img width="1669" alt="Screenshot 2025-01-13 at 4 51 06 PM" src="https://github.com/user-attachments/assets/71efa5b0-bec2-43d5-bcff-89cb811c6ad7" /> ## Analysis After looking at the VMA entries, it turns out kernel is not able to coalescing the continuous VMA entries because the offset of the backed memfd is allocated in wrong direction as the address space: ``` ... // Kernel coalescence is NOT happening e321d8140000-e321d8180000 rw-s 296d40000 00:01 3732459 /memfd:runsc-memory (deleted) -- 256 KiB e321d8180000-e321d81c0000 rw-s 296d00000 00:01 3732459 /memfd:runsc-memory (deleted) -- 256 KiB e321d81c0000-e321d8200000 rw-s 296cc0000 00:01 3732459 /memfd:runsc-memory (deleted) -- 256 KiB e321d8200000-e321d8240000 rw-s 296c80000 00:01 3732459 /memfd:runsc-memory (deleted) -- 256 KiB e321d8240000-e321d8280000 rw-s 296c40000 00:01 3732459 /memfd:runsc-memory (deleted) -- 256 KiB e321d8280000-e321d82c0000 rw-s 296c00000 00:01 3732459 /memfd:runsc-memory (deleted) -- 256 KiB e321d82c0000-e321d8300000 rw-s 296bc0000 00:01 3732459 /memfd:runsc-memory (deleted) -- 256 KiB e321d8300000-e321d8340000 rw-s 296b80000 00:01 3732459 /memfd:runsc-memory (deleted) -- 256 KiB ... // Kernel coalescence IS happening e321dd200000-e321dd400000 rw-s 4a600000 00:01 3732459 /memfd:runsc-memory (deleted) -- 2 MiB e321dde00000-e321de600000 rw-s 4a800000 00:01 3732459 /memfd:runsc-memory (deleted) -- 8 MiB e321de600000-e321dec00000 rw-s 4b200000 00:01 3732459 /memfd:runsc-memory (deleted) -- 8 MiB e321dec00000-e321df400000 rw-s 4ba00000 00:01 3732459 /memfd:runsc-memory (deleted) -- 8 MiB e321df400000-e321df600000 rw-s 4c600000 00:01 3732459 /memfd:runsc-memory (deleted) -- 2 MiB e321df600000-e321e0800000 rw-s 4ca00000 00:01 3732459 /memfd:runsc-memory (deleted) -- 18 MiB e321e0800000-e321e0e00000 rw-s 4de00000 00:01 3732459 /memfd:runsc-memory (deleted) -- 6 MiB e321e0e00000-e321e2a00000 rw-s 4f000000 00:01 3732459 /memfd:runsc-memory (deleted) -- 28 MiB e321e2a00000-e321e2c00000 rw-s 50e00000 00:01 3732459 /memfd:runsc-memory (deleted) -- 2 MiB ... ``` We are seeing two issues here: 1. We are comparing the last faulted addressing of vma to determine the ideal offset allocation direction here: https://github.com/google/gvisor/blob/master/pkg/sentry/mm/pma.go#L224. However, when VMA does not have `lastFault`, it defaults the offset allocation as `pgalloc.BottomUp`. On the server we use, gvisor's address space allocation direction is `TopDown`, thus it causes above mentioned behavior where address space is allocated in TopDown direction while memfd file offset is allocated in `BottomUp` direction. If my understanding is correct, if the VMA does not have `lastFault`, or its lastFault address is equal to current faulted address, we should keep the offset allocation direction to be consistent as the address space allocation direction 2. gVisor also performs VMA merging in its own in-memory VMA data structure, however, during the merge, we might lose the `lastFault` value if the VMA is merged into another VMA, causing the condition in https://github.com/google/gvisor/blob/master/pkg/sentry/mm/pma.go#L224 always evaluated as `false`. ## Why do we need to fix the issue Currently we are running our workloads in secured servers distributed by our release team. Our native kernel image has default value of `/proc/sys/vm/max_map_count` as `65,530`. We know it is fairly low, and we could potentially request for increase. However, we would like to touch base on the rootcause here and try to fix this discrepancy if possible due to: 1. Long process to get security approval for kernel config updates like this 2. Scalability is a concern as our workload can be running on larger servers, such as the ones with 1TB memory, if each 256 KiB memory request consumes a single VMA entry, in worst case we would need to ultimately increase the value to 4,194,304! 3. Having thousands of small VMA entries could potentially impact the kernel performance in a negative way as kernel needs more resources to manage and insert/delete those VMAs as needed. Thus, we tried to fix both issues mentioned above at best of our knowledge. After applying the PR change, we are able to control the VMA entries to a reasonable amount: <img width="1675" alt="Screenshot 2025-01-13 at 5 09 53 PM" src="https://github.com/user-attachments/assets/52edf7fd-6b15-40a5-9872-4d53f316e7c4" /> Please let us know if our understanding is correct, and if the proposed change makes sense to you. Thank you! FUTURE_COPYBARA_INTEGRATE_REVIEW=#11360 from Snowflake-Labs:xuzhoyin-vmas-fix 95db69e PiperOrigin-RevId: 719036542
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adjust memfd offset allocation direction properly
Background
Hello gVisor community! This is Xuzhou (Joe) from Snowflake Inc. We are currently working on utilizing gVisor internally as our secured sandboxing mechanism. We have met in-person with gVisor team last October sharing our use case and experiences with the gVisor team. As part of the meeting, we (Snowflake team) committed to contribute our internal fixes/improvements back to the upstream.
As part of compatibility testing with our previous mechanism, we found some behavior discrepancies between gVisor-emulated kernel and native linux kernel when making mmap system calls. Thus wanted to raise a pull request and see if this makes sense.
Issue we observed
Exactly the same workload creates ~100 kernel VMA entries with native linux kernel, while it creates ~70,000 VMA entries under gVisor.
Sample workload
When directly running above python job with
python3 test_mmap.py
, this process at most creates around 129 VMA entries by counting the entries under/proc/<pid>/maps
:However, when running it on gVisor, gVisor creates around 74,067 entries, which is 600 times more entries. We have
/proc/sys/vm/max_map_count
file set to65,530
by default, thus workload like this will trigger sandbox crash:Analysis
After looking at the VMA entries, it turns out kernel is not able to coalescing the continuous VMA entries because the offset of the backed memfd is allocated in wrong direction as the address space:
We are seeing two issues here:
lastFault
, it defaults the offset allocation aspgalloc.BottomUp
. On the server we use, gvisor's address space allocation direction isTopDown
, thus it causes above mentioned behavior where address space is allocated in TopDown direction while memfd file offset is allocated inBottomUp
direction. If my understanding is correct, if the VMA does not havelastFault
, or its lastFault address is equal to current faulted address, we should keep the offset allocation direction to be consistent as the address space allocation directionlastFault
value if the VMA is merged into another VMA, causing the condition in https://github.com/google/gvisor/blob/master/pkg/sentry/mm/pma.go#L224 always evaluated asfalse
.Why do we need to fix the issue
Currently we are running our workloads in secured servers distributed by our release team. Our native kernel image has default value of
/proc/sys/vm/max_map_count
as65,530
. We know it is fairly low, and we could potentially request for increase. However, we would like to touch base on the rootcause here and try to fix this discrepancy if possible due to:Thus, we tried to fix both issues mentioned above at best of our knowledge.
After applying the PR change, we are able to control the VMA entries to a reasonable amount:
Please let us know if our understanding is correct, and if the proposed change makes sense to you. Thank you!
FUTURE_COPYBARA_INTEGRATE_REVIEW=#11360 from Snowflake-Labs:xuzhoyin-vmas-fix 95db69e