-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpointing of Wasm container with podman+crun fails : Can't lookup mount
#1204
Comments
So this does not seem to be trivial. We are not actually trying to checkpoint the wasm application, but the wasm runtime. Not sure it makes sense to be able to checkpoint the wasm runtime. The easier solution would be if the wasm runtime supports checkpointing because the process we are trying to checkpoint is not running directly on Linux. One could say we should be able to checkpoint the runtime. That could be possible. From my understanding there is a mount in the runtime process CRIU cannot handle. The log file says:
I am able to reproduce this locally. On my system I see the last mount ID (by looking at
So, somewhere during setup if the wasm runtime the mount id But even if we are able to configure the mounts correctly I am not sure checkpoint/restore will be easily possible on this process. All the libraries used by the runtime process are only available on the host system and not in the container. I do not understand enough of how the wasm runtime is configured or how it works or if there is anything missing during the setup of the runtime. I tried to checkpoint the wasm runtime ( @Snorch do you have any ideas why the mount id is not visible in the container. Any suggestions what could be done to solve this. |
This is a source of problem. CRIU does not support dumping external resources. If you run some app and it has file mapping in memory, CRIU does not save the memory belonging to this file mapping to images, CRIU relies that it can recreate those mappings from files from the container filesystem on restore. (roughly speaking). So if backing file of the mapping is not available inside container filesystem CRIU would not be able to restore it (e.g. if mount is not available inside container filesystem and thus CRIU would not be able to find file on it). To support some external resource (file) dumping in container, one should explicitly specify each such resource via CRIU options. https://github.com/checkpoint-restore/criu/blob/33dd66c6fc93c47213aaa0447a94d97ba1fa56ba/Documentation/criu.txt#L236 |
Thanks for your detailed answers.
Indeed, for the checkpoint / restore use case, only checkpointing the wasm application would suffice. More precisely, we would need to save the binary + its internal state, i.e. the whole content of the wasm runtime virtual memory (some optimizations may be made by saving only parts of the wasm runtime virtual memory, but I do not think the gained performance will be significant, and I do not know the wasm specs well enough to be able to tell if it is even feasible). The main obstacle I see to this is that the internal state of the wasm runtime virtual memory may depend on the runtime used. This means that checkpointing and restoring would depend on the wasm runtime embedded within crun, and also its version (which is, from what I know, not accessible once embedded within crun). It also means that the checkpointing functionality should be implemented within all the wasm runtimes who can run in containers (at least those who can be embedded within crun, which are from what I know,
In fact, I am more interested in checkpointing the whole runtime / container than just the contents of the wasm app. I'm using checkpointing for forensic analysis purposes and having the possibility to take a look at the whole runtime instead of only the app seems more interesting as it can enable us to detect runtime compromising or breakout. Moreover, in the case on using checkpointing at large scale on containers, as the recent introduction of checkpointing to the Kubernetes world can allow us, we're looking at automating checkpointing and analysis to help detect compromises within containers. For this kind of automation (and many other use cases that may arise from the democratization of container checkpointing), having the same format to analyze for both classic and wasm containers would be essential. I'm not sure that checkpointing only the wasm application would allow to have a level of detail and flexibility comparable to the checkpointing of the whole container or runtime. Finally, a wasm container can contain more than just the wasm app : configuration files, storage (database or other), other applications or libraries... Which may make checkpointing only the state of the wasm app less relevant.
crun is embedding a wasm runtime (only the core runtime). To be able to run a wasm container, the wasm runtime library (typically Indeed, this means that at contrary to classic containers, where all the binaries needed to run the containers (coreutils and more) are present within the container image, the host needing only to provide access to its kernel. For wasm, an external library (and more ?) is used. I don't know how container checkpointing works internally, and if it is supposed to depend on the @Snorch : Thanks for the highlights. I only tried to checkpoint the container through podman or crun, and the I'm also wondering how the wasm library is loaded from within the container. Is it mounted inside the container ? With the good permissions / namespaces / etc ? |
In theory it should be possible to restore a checkpoint from runc with crun, there is nothing runtime specific in the checkpoint. I think it does not work currently, but just because nobody looked into making it possible. I do not think there is a real technical problem. The main problem, from my point of view, is the used libraries. If you do something like For CRIU to restore a process a used libraries must be exactly the same. Not just ABI compatible all open files must be exactly the same. So if between checkpoint and restoring only on used resource (libraries) must likely is updated you cannot restore it. If all files are in the container they will probably not change. From my point of view it makes not sense to implement wasm application checkpointing and restoring. I understand what you are trying to do, but to make it work I think it would make more sense to have crun setup wasm in such a way that CRIU does not fail. Restoring would still be difficult if anything changed on the host. Maybe it would make sense to integrate checkpointing in each wasm runtime, just like the JVM tries to do for faster startup. |
This is the same issue than checkpoint-restore/criu#2170. I'm opening it here on the advice of @adrianreber who thinks this issue is related to the Wasm implementation in crun and not a problem within criu.
Description
When trying to checkpoint a wasm container started with podman + crun with wasmedge support, the checkpointing fails with an error like:
This happens on both Fedora 38 (btrfs) and Debian 11 (ext4) up-to-date. For both OSes the error at the end of the
dump.log
file is the same, excepted for the mount number and pid.Steps to reproduce the issue:
And build with
You can check it is running with
podman logs demo-wasm-1
. You should see a lot of "Hello Wasm" printed.And notice it is failing.
Describe the results you received:
The checkpointing of the container fails
Describe the results you expected:
The checkpointing succeeds
Additional information you deem important (e.g. issue happens only occasionally):
The issue happens with the most simple of Wasm container. I was able to checkpoint and restore normal containers (debian and others) on the same machine without any issue.
logs and information:
Output of
podman container checkpoint
command :dump.log
file is attached :dump.log
Output of `criu --version`:
Output of `criu check --all`:
Podman version 4.5.0
crun --version
:Additional environment details:
Tried on both Fedora 38 (btrfs) and Debian 11 (ext4) in VMs. Criu installed from respective package managers. Outputs are from the Fedora machine. Both crun were using wasmedge as wasm runtime but I'll check if the issue is also present with other wasm runtimes like wasmtime and wasmer.
The text was updated successfully, but these errors were encountered: