Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust memfd offset allocation direction properly #11386

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

copybara-service[bot]
Copy link

Adjust memfd offset allocation direction properly

Background

Hello gVisor community! This is Xuzhou (Joe) from Snowflake Inc. We are currently working on utilizing gVisor internally as our secured sandboxing mechanism. We have met in-person with gVisor team last October sharing our use case and experiences with the gVisor team. As part of the meeting, we (Snowflake team) committed to contribute our internal fixes/improvements back to the upstream.

As part of compatibility testing with our previous mechanism, we found some behavior discrepancies between gVisor-emulated kernel and native linux kernel when making mmap system calls. Thus wanted to raise a pull request and see if this makes sense.

Issue we observed

Exactly the same workload creates ~100 kernel VMA entries with native linux kernel, while it creates ~70,000 VMA entries under gVisor.

Sample workload

def alloc(mb):

    # Append rows to the DataFrame
    num_rows = 1024 * 1024 * mb

    # Initialize an empty list to store rows
    rows = []

    # Collect rows in the list (instead of using pd.concat in every iteration)
    for i in range(num_rows):
        rows.append(['*'])  # Append row as a list

    # Create the DataFrame all at once
    return "123"

# This would allocate around 20GiB memory
alloc(250);

When directly running above python job with python3 test_mmap.py, this process at most creates around 129 VMA entries by counting the entries under /proc/<pid>/maps:
Screenshot 2025-01-13 at 4 49 53 PM

However, when running it on gVisor, gVisor creates around 74,067 entries, which is 600 times more entries. We have /proc/sys/vm/max_map_count file set to 65,530 by default, thus workload like this will trigger sandbox crash:
Screenshot 2025-01-13 at 4 51 06 PM

Analysis

After looking at the VMA entries, it turns out kernel is not able to coalescing the continuous VMA entries because the offset of the backed memfd is allocated in wrong direction as the address space:

...
// Kernel coalescence is NOT happening
e321d8140000-e321d8180000 rw-s 296d40000 00:01 3732459                   /memfd:runsc-memory (deleted) -- 256 KiB
e321d8180000-e321d81c0000 rw-s 296d00000 00:01 3732459                   /memfd:runsc-memory (deleted) -- 256 KiB
e321d81c0000-e321d8200000 rw-s 296cc0000 00:01 3732459                   /memfd:runsc-memory (deleted) -- 256 KiB
e321d8200000-e321d8240000 rw-s 296c80000 00:01 3732459                   /memfd:runsc-memory (deleted) -- 256 KiB
e321d8240000-e321d8280000 rw-s 296c40000 00:01 3732459                   /memfd:runsc-memory (deleted) -- 256 KiB
e321d8280000-e321d82c0000 rw-s 296c00000 00:01 3732459                   /memfd:runsc-memory (deleted) -- 256 KiB
e321d82c0000-e321d8300000 rw-s 296bc0000 00:01 3732459                   /memfd:runsc-memory (deleted) -- 256 KiB
e321d8300000-e321d8340000 rw-s 296b80000 00:01 3732459                   /memfd:runsc-memory (deleted) -- 256 KiB
...

// Kernel coalescence IS happening
e321dd200000-e321dd400000 rw-s 4a600000 00:01 3732459                    /memfd:runsc-memory (deleted) --  2 MiB
e321dde00000-e321de600000 rw-s 4a800000 00:01 3732459                    /memfd:runsc-memory (deleted) --  8 MiB
e321de600000-e321dec00000 rw-s 4b200000 00:01 3732459                    /memfd:runsc-memory (deleted) --  8 MiB
e321dec00000-e321df400000 rw-s 4ba00000 00:01 3732459                    /memfd:runsc-memory (deleted) --  8 MiB
e321df400000-e321df600000 rw-s 4c600000 00:01 3732459                    /memfd:runsc-memory (deleted) --  2 MiB
e321df600000-e321e0800000 rw-s 4ca00000 00:01 3732459                    /memfd:runsc-memory (deleted) -- 18 MiB
e321e0800000-e321e0e00000 rw-s 4de00000 00:01 3732459                    /memfd:runsc-memory (deleted) --  6 MiB
e321e0e00000-e321e2a00000 rw-s 4f000000 00:01 3732459                    /memfd:runsc-memory (deleted) -- 28 MiB
e321e2a00000-e321e2c00000 rw-s 50e00000 00:01 3732459                    /memfd:runsc-memory (deleted) --  2 MiB
...

We are seeing two issues here:

  1. We are comparing the last faulted addressing of vma to determine the ideal offset allocation direction here: https://github.com/google/gvisor/blob/master/pkg/sentry/mm/pma.go#L224. However, when VMA does not have lastFault, it defaults the offset allocation as pgalloc.BottomUp. On the server we use, gvisor's address space allocation direction is TopDown, thus it causes above mentioned behavior where address space is allocated in TopDown direction while memfd file offset is allocated in BottomUp direction. If my understanding is correct, if the VMA does not have lastFault, or its lastFault address is equal to current faulted address, we should keep the offset allocation direction to be consistent as the address space allocation direction
  2. gVisor also performs VMA merging in its own in-memory VMA data structure, however, during the merge, we might lose the lastFault value if the VMA is merged into another VMA, causing the condition in https://github.com/google/gvisor/blob/master/pkg/sentry/mm/pma.go#L224 always evaluated as false.

Why do we need to fix the issue

Currently we are running our workloads in secured servers distributed by our release team. Our native kernel image has default value of /proc/sys/vm/max_map_count as 65,530. We know it is fairly low, and we could potentially request for increase. However, we would like to touch base on the rootcause here and try to fix this discrepancy if possible due to:

  1. Long process to get security approval for kernel config updates like this
  2. Scalability is a concern as our workload can be running on larger servers, such as the ones with 1TB memory, if each 256 KiB memory request consumes a single VMA entry, in worst case we would need to ultimately increase the value to 4,194,304!
  3. Having thousands of small VMA entries could potentially impact the kernel performance in a negative way as kernel needs more resources to manage and insert/delete those VMAs as needed.

Thus, we tried to fix both issues mentioned above at best of our knowledge.

After applying the PR change, we are able to control the VMA entries to a reasonable amount:
Screenshot 2025-01-13 at 5 09 53 PM

Please let us know if our understanding is correct, and if the proposed change makes sense to you. Thank you!

FUTURE_COPYBARA_INTEGRATE_REVIEW=#11360 from Snowflake-Labs:xuzhoyin-vmas-fix 95db69e

## Background
Hello gVisor community! This is Xuzhou (Joe) from Snowflake Inc. We are currently working on utilizing gVisor internally as our secured sandboxing mechanism. We have met in-person with gVisor team last October sharing our use case and experiences with the gVisor team. As part of the meeting, we (Snowflake team) committed to contribute our internal fixes/improvements back to the upstream.

As part of compatibility testing with our previous mechanism, we found some behavior discrepancies between gVisor-emulated kernel and native linux kernel when making mmap system calls. Thus wanted to raise a pull request and see if this makes sense.

## Issue we observed
Exactly the same workload creates ~100 kernel VMA entries with native linux kernel, while it creates ~70,000 VMA entries under gVisor.

## Sample workload
```
def alloc(mb):

    # Append rows to the DataFrame
    num_rows = 1024 * 1024 * mb

    # Initialize an empty list to store rows
    rows = []

    # Collect rows in the list (instead of using pd.concat in every iteration)
    for i in range(num_rows):
        rows.append(['*'])  # Append row as a list

    # Create the DataFrame all at once
    return "123"

# This would allocate around 20GiB memory
alloc(250);
```

When directly running above python job with `python3 test_mmap.py`, this process at most creates around 129 VMA entries by counting the entries under `/proc/<pid>/maps`:
<img width="1682" alt="Screenshot 2025-01-13 at 4 49 53 PM" src="https://github.com/user-attachments/assets/c886a07d-2cfc-4f09-ac15-d064afe5c3ee" />

However, when running it on gVisor, gVisor creates around 74,067 entries, which is 600 times more entries. We have `/proc/sys/vm/max_map_count` file set to `65,530` by default, thus workload like this will trigger sandbox crash:
<img width="1669" alt="Screenshot 2025-01-13 at 4 51 06 PM" src="https://github.com/user-attachments/assets/71efa5b0-bec2-43d5-bcff-89cb811c6ad7" />

## Analysis
After looking at the VMA entries, it turns out kernel is not able to coalescing the continuous VMA entries because the offset of the backed memfd is allocated in wrong direction as the address space:

```
...
// Kernel coalescence is NOT happening
e321d8140000-e321d8180000 rw-s 296d40000 00:01 3732459                   /memfd:runsc-memory (deleted) -- 256 KiB
e321d8180000-e321d81c0000 rw-s 296d00000 00:01 3732459                   /memfd:runsc-memory (deleted) -- 256 KiB
e321d81c0000-e321d8200000 rw-s 296cc0000 00:01 3732459                   /memfd:runsc-memory (deleted) -- 256 KiB
e321d8200000-e321d8240000 rw-s 296c80000 00:01 3732459                   /memfd:runsc-memory (deleted) -- 256 KiB
e321d8240000-e321d8280000 rw-s 296c40000 00:01 3732459                   /memfd:runsc-memory (deleted) -- 256 KiB
e321d8280000-e321d82c0000 rw-s 296c00000 00:01 3732459                   /memfd:runsc-memory (deleted) -- 256 KiB
e321d82c0000-e321d8300000 rw-s 296bc0000 00:01 3732459                   /memfd:runsc-memory (deleted) -- 256 KiB
e321d8300000-e321d8340000 rw-s 296b80000 00:01 3732459                   /memfd:runsc-memory (deleted) -- 256 KiB
...

// Kernel coalescence IS happening
e321dd200000-e321dd400000 rw-s 4a600000 00:01 3732459                    /memfd:runsc-memory (deleted) --  2 MiB
e321dde00000-e321de600000 rw-s 4a800000 00:01 3732459                    /memfd:runsc-memory (deleted) --  8 MiB
e321de600000-e321dec00000 rw-s 4b200000 00:01 3732459                    /memfd:runsc-memory (deleted) --  8 MiB
e321dec00000-e321df400000 rw-s 4ba00000 00:01 3732459                    /memfd:runsc-memory (deleted) --  8 MiB
e321df400000-e321df600000 rw-s 4c600000 00:01 3732459                    /memfd:runsc-memory (deleted) --  2 MiB
e321df600000-e321e0800000 rw-s 4ca00000 00:01 3732459                    /memfd:runsc-memory (deleted) -- 18 MiB
e321e0800000-e321e0e00000 rw-s 4de00000 00:01 3732459                    /memfd:runsc-memory (deleted) --  6 MiB
e321e0e00000-e321e2a00000 rw-s 4f000000 00:01 3732459                    /memfd:runsc-memory (deleted) -- 28 MiB
e321e2a00000-e321e2c00000 rw-s 50e00000 00:01 3732459                    /memfd:runsc-memory (deleted) --  2 MiB
...
```

We are seeing two issues here:
1. We are comparing the last faulted addressing of vma to determine the ideal offset allocation direction here: https://github.com/google/gvisor/blob/master/pkg/sentry/mm/pma.go#L224. However, when VMA does not have `lastFault`, it defaults the offset allocation as `pgalloc.BottomUp`. On the server we use, gvisor's address space allocation direction is `TopDown`, thus it causes above mentioned behavior where address space is allocated in TopDown direction while memfd file offset is allocated in `BottomUp` direction. If my understanding is correct, if the VMA does not have `lastFault`, or its lastFault address is equal to current faulted address, we should keep the offset allocation direction to be consistent as the address space allocation direction
2. gVisor also performs VMA merging in its own in-memory VMA data structure, however, during the merge, we might lose the `lastFault` value if the VMA is merged into another VMA, causing the condition in https://github.com/google/gvisor/blob/master/pkg/sentry/mm/pma.go#L224 always evaluated as `false`.

## Why do we need to fix the issue
Currently we are running our workloads in secured servers distributed by our release team. Our native kernel image has default value of `/proc/sys/vm/max_map_count` as `65,530`. We know it is fairly low, and we could potentially request for increase. However, we would like to touch base on the rootcause here and try to fix this discrepancy if possible due to:
1. Long process to get security approval for kernel config updates like this
2. Scalability is a concern as our workload can be running on larger servers, such as the ones with 1TB memory, if each 256 KiB memory request consumes a single VMA entry, in worst case we would need to ultimately increase the value to 4,194,304!
3. Having thousands of small VMA entries could potentially impact the kernel performance in a negative way as kernel needs more resources to manage and insert/delete those VMAs as needed.

Thus, we tried to fix both issues mentioned above at best of our knowledge.

After applying the PR change, we are able to control the VMA entries to a reasonable amount:
<img width="1675" alt="Screenshot 2025-01-13 at 5 09 53 PM" src="https://github.com/user-attachments/assets/52edf7fd-6b15-40a5-9872-4d53f316e7c4" />

Please let us know if our understanding is correct, and if the proposed change makes sense to you. Thank you!

FUTURE_COPYBARA_INTEGRATE_REVIEW=#11360 from Snowflake-Labs:xuzhoyin-vmas-fix 95db69e
PiperOrigin-RevId: 719036542
@copybara-service copybara-service bot added the exported Issue was exported automatically label Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exported Issue was exported automatically
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant