-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stuck at "Cleaning... X files" for long periods of time #2444
Comments
Could you give an example project where this happens e.g. CMake or Meson etc. |
@scivision, This happens when building GIMP. I investigated a bit. I found that it's waiting on this ppoll call: |
This is occasionally also hitting GStreamer (also using meson). Attaching a debugger I could confirm ninja was waiting inside ppoll() for a events on a pipe (the only pipe open in that process, arguably the one given to meson, fd 4, listening to The weird thing is that doing When I detached the debugger ninja unstuck itself. There is still the possiblity that meson shared that pipe with another process I didn't catch and that process terminated somewhere between the time I attached the debugger and inspected for processes that had that same pipe open. Otherwise this would hint to a kernel bug, because you would expect ppoll() to return and notify |
I managed to reproduce the bug a second time, and this time I caught who was keeping the write-side of the pipe open.
sccache is a compiler wrapper for cached compilation artifacts and remote compilation. It replaces your compiler by a wrapper, which is likely how it got executed by meson, and that wrapper spawns a sccache server if none is running. Quoting its README:
Also:
Which explains how after a rather long time ninja got unstuck. The sccache server exited due to inactivity. What I haven't figured out yet is how sccache ends up holding up to the pipe. Generally when you daemonize a subprocess you want to close all file descriptors you don't need, and you can see in the |
I had a look at how ninja creates subprocesses in POSIX systems. It has two different modes, depending on whether the console is shared to the subprocess or not. If the console is not to be shared to the subprocess, a pipe is created and given to the subprocess as both stdout and stderr. Care is also taken to close the original file descriptors of both ends of the pipe. This all makes sense. However, if the console is to be shared to the subprocess, the pipe is still created and intentionally left open for the subprocess, with the intention of this being used to detect when the subprocess exited (!). // In the console case, output_pipe is still inherited by the child and
// closed when the subprocess finishes, which then notifies ninja. Here is a table of the file descriptors, of the parent process before the posix_spawn() and of the subprocesses in the two different modes:
Relying on EOF from a pipe given to a subprocess in a non-standard fd without its knowledge is not a very orthodox way of detecting when a subprocess has exited and can lead to strange situations like the one here with sccache. The same problem can occur if any subprocess (or subprocess of the subprocess etc.) tries to spawn any daemon and its code doesn't take the extra care to use something like Normally, instead of this you would rely on something like the SIGCHLD signal (which would still make ppoll() return) -- or pidfd if restricted to recent Linux. |
Otherwise, when the compiler wrapper spawns the sccache server, the server may end up with unintended file descriptors, which can lead to unexpected problems. This is particularly problematic if any of those file descriptors that accidentally end up in the server process is a pipe, as the pipe will only be closed when all the processes with that file descriptor close it or exit. This was causing sccache to hang ninja, as ninja uses the EOF of a pipe given to the subprocess to detect when that subprocess has exited: ninja-build/ninja#2444 (comment) This patch adds a dependency on the [close_fds](https://crates.io/crates/close_fds) crate, which automatically chooses an appropriate mechanism to close a range of file descriptors. On Linux 5.9+ that mechanism will be libc::close_range(). Fixes mozilla#2313
Otherwise, when the compiler wrapper spawns the sccache server, the server may end up with unintended file descriptors, which can lead to unexpected problems. This is particularly problematic if any of those file descriptors that accidentally end up in the server process is a pipe, as the pipe will only be closed when all the processes with that file descriptor close it or exit. This was causing sccache to hang ninja, as ninja uses the EOF of a pipe given to the subprocess to detect when that subprocess has exited: ninja-build/ninja#2444 (comment) This patch adds a dependency on the [close_fds](https://crates.io/crates/close_fds) crate, which automatically chooses an appropriate mechanism to close a range of file descriptors. On Linux 5.9+ that mechanism will be libc::close_range(). Fixes mozilla#2313
For background, see ninja-build#2444 (comment). In short, when running subprocesses that share the terminal, ninja intentionally leaves a pipe open before exec() so that it can use EOF from that pipe to detect when the subprocess has exited. That mechanism is problematic: If the subprocess ends up spawning background processes (e.g. sccache), those would also inherit the pipe by default. In that case, ninja may not detect process termination until all background processes have quitted. This patch makes it so that, instead of propagating the pipe file descriptor to the subprocess without its knowledge, ninja keeps both ends of the pipe to itself, and uses a SIGCHLD handler to close the write end of the pipe when the subprocess has truly exited. During testing I found Subprocess::Finish() lacked EINTR retrying, which made ninja crash prematurely. This patch also fixes that. Fixes ninja-build#2444
For background, see ninja-build#2444 (comment). In short, when running subprocesses that share the terminal, ninja intentionally leaves a pipe open before exec() so that it can use EOF from that pipe to detect when the subprocess has exited. That mechanism is problematic: If the subprocess ends up spawning background processes (e.g. sccache), those would also inherit the pipe by default. In that case, ninja may not detect process termination until all background processes have quitted. This patch makes it so that, instead of propagating the pipe file descriptor to the subprocess without its knowledge, ninja keeps both ends of the pipe to itself, and uses a SIGCHLD handler to close the write end of the pipe when the subprocess has truly exited. During testing I found Subprocess::Finish() lacked EINTR retrying, which made ninja crash prematurely. This patch also fixes that. Fixes ninja-build#2444
Ninja version:1.12.0
OS: Archlinux
Whenever I run
ninja
to build a project, it gets stuck atCleaning... X files
for a long period of time (sometimes up to 5 minutes) even when there are no files to clean.While it is stuck, there is 0 CPU usage, so it's not doing anything.
htop shows that a spawned process is in a zombie state waiting to be reaped:
Here is a related meson report: mesonbuild/meson#10645
I appreciate any help debugging this further.
Thanks a lot.
The text was updated successfully, but these errors were encountered: