Stuck at "Cleaning... X files" for long periods of time #2444

user062 · 2024-05-07T15:34:21Z

Ninja version:1.12.0
OS: Archlinux

Whenever I run ninja to build a project, it gets stuck at Cleaning... X files for a long period of time (sometimes up to 5 minutes) even when there are no files to clean.
While it is stuck, there is 0 CPU usage, so it's not doing anything.

htop shows that a spawned process is in a zombie state waiting to be reaped:

Here is a related meson report: mesonbuild/meson#10645

I appreciate any help debugging this further.

Thanks a lot.

The text was updated successfully, but these errors were encountered:

scivision · 2024-05-17T19:26:16Z

Could you give an example project where this happens e.g. CMake or Meson etc.

user062 · 2024-05-17T20:14:15Z

@scivision, This happens when building GIMP.

I investigated a bit. I found that it's waiting on this ppoll call:
subprocess-posix.cc:267

ntrrgc · 2025-01-08T14:19:22Z

This is occasionally also hitting GStreamer (also using meson).

Attaching a debugger I could confirm ninja was waiting inside ppoll() for a events on a pipe (the only pipe open in that process, arguably the one given to meson, fd 4, listening to POLLIN | POLLPRI).

The weird thing is that doing ps -n -P |grep <pipe inode> only returned the ninja process, meaning that the other end of the pipe had already been closed. meson was a zombie (<defunt>) process at that point.

When I detached the debugger ninja unstuck itself.

There is still the possiblity that meson shared that pipe with another process I didn't catch and that process terminated somewhere between the time I attached the debugger and inspected for processes that had that same pipe open. Otherwise this would hint to a kernel bug, because you would expect ppoll() to return and notify POLLHUP if the other end of the pipe was closed and all data has been read (and if not all data has been read, ppoll() should still return with a POLLIN event while data is available).

ntrrgc · 2025-01-08T14:41:09Z

I managed to reproduce the bug a second time, and this time I caught who was keeping the write-side of the pipe open.

$ ls /proc/801351/fd
total 0
lrwx------ 1 ntrrgc ntrrgc 64 Jan  8 14:58 0 -> /dev/pts/1
lrwx------ 1 ntrrgc ntrrgc 64 Jan  8 14:58 1 -> /dev/pts/1
lrwx------ 1 ntrrgc ntrrgc 64 Jan  8 14:58 2 -> /dev/pts/1
lr-x------ 1 ntrrgc ntrrgc 64 Jan  8 14:58 3 -> 'pipe:[2292603]'
$ lsof -n -P |grep 2292603
ninja     801351                           ntrrgc    3r     FIFO               0,15       0t0   2292603 pipe
sccache   801474                           ntrrgc    4w     FIFO               0,15       0t0   2292603 pipe
sccache   801474 801475 sccache            ntrrgc    4w     FIFO               0,15       0t0   2292603 pipe
sccache   801474 801476 tokio-run          ntrrgc    4w     FIFO               0,15       0t0   2292603 pipe
sccache   801474 801477 tokio-run          ntrrgc    4w     FIFO               0,15       0t0   2292603 pipe
sccache   801474 801478 tokio-run          ntrrgc    4w     FIFO               0,15       0t0   2292603 pipe
...

sccache is a compiler wrapper for cached compilation artifacts and remote compilation. It replaces your compiler by a wrapper, which is likely how it got executed by meson, and that wrapper spawns a sccache server if none is running. Quoting its README:

The sccache command will spawn a server process if one is not already running

Also:

You can run sccache --stop-server to terminate the server. It will also terminate after (by default) 10 minutes of inactivity.

Which explains how after a rather long time ninja got unstuck. The sccache server exited due to inactivity.

What I haven't figured out yet is how sccache ends up holding up to the pipe. Generally when you daemonize a subprocess you want to close all file descriptors you don't need, and you can see in the utils::daemonize() function of sccache that it does try to at least close stdin, stdout and stderr and replace them with /dev/null. However, that by itself wouldn't take care of higher file descriptors, and we can see in the previous output that the pipe existed on fd=4.

ntrrgc · 2025-01-08T15:45:13Z

I had a look at how ninja creates subprocesses in POSIX systems.

It has two different modes, depending on whether the console is shared to the subprocess or not.

If the console is not to be shared to the subprocess, a pipe is created and given to the subprocess as both stdout and stderr. Care is also taken to close the original file descriptors of both ends of the pipe. This all makes sense.

However, if the console is to be shared to the subprocess, the pipe is still created and intentionally left open for the subprocess, with the intention of this being used to detect when the subprocess exited (!).

// In the console case, output_pipe is still inherited by the child and
// closed when the subprocess finishes, which then notifies ninja.

Here is a table of the file descriptors, of the parent process before the posix_spawn() and of the subprocesses in the two different modes:

fd	parent process (ninja)	subprocess (no-console)	subprocess (shared console)
0	stdin	/dev/null	stdin
1	stdout	write-side of pipe	stdout
2	stderr	write-side of pipe	stderr
3	read-side of pipe (with CLOEXEC)	closed	closed
4	write-side of pipe (closed after spawn)	closed	write-side of pipe

Relying on EOF from a pipe given to a subprocess in a non-standard fd without its knowledge is not a very orthodox way of detecting when a subprocess has exited and can lead to strange situations like the one here with sccache.

The same problem can occur if any subprocess (or subprocess of the subprocess etc.) tries to spawn any daemon and its code doesn't take the extra care to use something like close_range() to close all file descriptors it doesn't know of. And if a foreground process did that for whatever reason, it would be wrongly signalled to ninja as if the process had finished.

Normally, instead of this you would rely on something like the SIGCHLD signal (which would still make ppoll() return) -- or pidfd if restricted to recent Linux.

Otherwise, when the compiler wrapper spawns the sccache server, the server may end up with unintended file descriptors, which can lead to unexpected problems. This is particularly problematic if any of those file descriptors that accidentally end up in the server process is a pipe, as the pipe will only be closed when all the processes with that file descriptor close it or exit. This was causing sccache to hang ninja, as ninja uses the EOF of a pipe given to the subprocess to detect when that subprocess has exited: ninja-build/ninja#2444 (comment) This patch adds a dependency on the [close_fds](https://crates.io/crates/close_fds) crate, which automatically chooses an appropriate mechanism to close a range of file descriptors. On Linux 5.9+ that mechanism will be libc::close_range(). Fixes mozilla#2313

For background, see ninja-build#2444 (comment). In short, when running subprocesses that share the terminal, ninja intentionally leaves a pipe open before exec() so that it can use EOF from that pipe to detect when the subprocess has exited. That mechanism is problematic: If the subprocess ends up spawning background processes (e.g. sccache), those would also inherit the pipe by default. In that case, ninja may not detect process termination until all background processes have quitted. This patch makes it so that, instead of propagating the pipe file descriptor to the subprocess without its knowledge, ninja keeps both ends of the pipe to itself, and uses a SIGCHLD handler to close the write end of the pipe when the subprocess has truly exited. During testing I found Subprocess::Finish() lacked EINTR retrying, which made ninja crash prematurely. This patch also fixes that. Fixes ninja-build#2444

ntrrgc mentioned this issue Jan 8, 2025

sccache could be more zealous closing file descriptors when daemonizing mozilla/sccache#2313

Open

ntrrgc mentioned this issue Jan 8, 2025

Close non-stdio file handles when daemonizing mozilla/sccache#2314

Open

ntrrgc linked a pull request Jan 10, 2025 that will close this issue

Replace covert pipe with self-pipe SIGCHLD handler #2550

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck at "Cleaning... X files" for long periods of time #2444

Stuck at "Cleaning... X files" for long periods of time #2444

user062 commented May 7, 2024 •

edited

Loading

scivision commented May 17, 2024

user062 commented May 17, 2024

ntrrgc commented Jan 8, 2025

ntrrgc commented Jan 8, 2025

ntrrgc commented Jan 8, 2025

Stuck at "Cleaning... X files" for long periods of time #2444

Stuck at "Cleaning... X files" for long periods of time #2444

Comments

user062 commented May 7, 2024 • edited Loading

scivision commented May 17, 2024

user062 commented May 17, 2024

ntrrgc commented Jan 8, 2025

ntrrgc commented Jan 8, 2025

ntrrgc commented Jan 8, 2025

user062 commented May 7, 2024 •

edited

Loading