Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate supbrocess' exit codes to the ninja exit code #2540

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Felixoid
Copy link

@Felixoid Felixoid commented Dec 3, 2024

Now, ninjas could exit with codes 0, 1, and 2.

With these codes, it's nearly impossible to get what went wrong in the subcommands.

This patch propagates the subcommand's exit code to the ninja's exit code.

It fixes #1123, fixes #1507, and technically substitutes the #1805

@Felixoid Felixoid force-pushed the master branch 2 times, most recently from 64c6798 to 91bffbe Compare December 3, 2024 15:09
@Felixoid
Copy link
Author

Felixoid commented Dec 3, 2024

The reason why do we need it:

In ClickHouse, we use prlimit to identify when some binary becomes too big, but it fails in the entirely same way as a standard failing build.

And here how the prlimit is able to fail in different cases:

$ prlimit --data=20000000 --cpu=2 clickhouse-local -q 'SELECT * FROM (SELECT * FROM system.numbers LIMIT 100000000) GROUP BY number' ; echo $?
Code: 173. DB::ErrnoException: Allocator: Cannot malloc 1.00 MiB.: , errno: 12, strerror: Cannot allocate memory. (CANNOT_ALLOCATE_MEMORY)
173
$ prlimit --data=15000000 --cpu=2 clickhouse-local -q 'SELECT * FROM (SELECT * FROM system.numbers LIMIT 100000000) GROUP BY number' ; echo $?
Segmentation fault (core dumped)
139
$ prlimit --cpu=2 clickhouse-local -q 'SELECT * FROM (SELECT * FROM system.numbers LIMIT 100000000) GROUP BY number' ; echo $?
Killed
137

@Felixoid Felixoid marked this pull request as draft December 3, 2024 16:36
@Felixoid Felixoid marked this pull request as ready for review December 3, 2024 18:08
@Felixoid
Copy link
Author

Felixoid commented Dec 3, 2024

After the last push, it works like this:

> ./ninja test-exit; echo $?
[1/1] exit 155
FAILED: test-exit
exit 155
ninja: build stopped: subcommand failed.
155
> ./ninja test-exit; echo $?
[1/1] exit 1
FAILED: test-exit
exit 1
ninja: build stopped: subcommand failed.
1
> ./ninja test-exit; echo $?
[0/1] exit 2
ninja: build stopped: interrupted by user.
130

@Felixoid
Copy link
Author

Felixoid commented Dec 3, 2024

The latest commit adds the exit code to the FAILED log line:

> ./ninja prlimit-kill prlimit-allocate prlimit-sigv ; echo $?
[1/3] prlimit --data=1 clickhouse-local -q 'SELECT * FROM (SELECT * FROM system.numbers GROUP BY number LIMIT 100000000000) GROUP BY number'
FAILED[code 139] prlimit-sigv
prlimit --data=1 clickhouse-local -q 'SELECT * FROM (SELECT * FROM system.numbers GROUP BY number LIMIT 100000000000) GROUP BY number'
[2/3] prlimit --data=9999999 clickhouse-local -q 'SELECT * FROM (SELECT * FROM system.numbers GROUP BY number LIMIT 100000000000) GROUP BY number'
FAILED[code 139] prlimit-allocate
prlimit --data=9999999 clickhouse-local -q 'SELECT * FROM (SELECT * FROM system.numbers GROUP BY number LIMIT 100000000000) GROUP BY number'
[3/3] prlimit --cpu=1 clickhouse-local -q 'SELECT * FROM (SELECT * FROM system.numbers GROUP BY number LIMIT 100000000000) GROUP BY number'
FAILED[code 137] prlimit-kill
prlimit --cpu=1 clickhouse-local -q 'SELECT * FROM (SELECT * FROM system.numbers GROUP BY number LIMIT 100000000000) GROUP BY number'
ninja: build stopped: subcommand failed.
137

@Felixoid
Copy link
Author

Felixoid commented Dec 4, 2024

Dear @jhasse and @digit-google, could you please leave a comment on whether it makes sense or if there is something left to improve before the merge?

@digit-google
Copy link
Contributor

Thank you for your contribution. I am very busy this week, but I'll take a look the next one. Sorry about that.

@digit-google
Copy link
Contributor

Thank you very much for you PR. This feature will be useful.

I think that keeping a dedicated type for the exit status will help keeping the code easier to read and understand. Besides, native exit codes on Win32 are unsigned long, even though the exit() function invoked from Ninja still takes an int. I suggest to achieve this by:

  1. Using C++11 enum sizing with platform-specific types, as in:
// from exit_status.h

// The underlying type of the ExitStatus enum, used to represent a platform-specific
// process exit code.
#ifdef _WIN32
#define EXIT_STATUS_TYPE unsigned long
#else  // !_WIN32
#define EXIT_STATUS_TYPE int
#endif  // !_WIN32

enum ExitStatus : EXIT_STATUS_TYPE {
  ExitSuccess = 0,
  ExitFailure,
  ExitInterrupted,
};
  1. Modify Builder::Build() to return an ExitStatus value instead of a boolean.

Also, on Posix, ensure that an interrupted build returns static_cast<ExitStatus>(130) directly, to get rid of the hack on line 1580 of ninja.cc to do that.

  1. Only cast the ExitStatus to an int when needed, i.e. when calling exit() from ninja.cc. This means NinjaMain::RunBuild() would also return an ExitStatus value instead of an int.

Apart from that, can I ask you to rebase your commits to remove temporary fixes? Nobody really needs this information but you, and it makes reviews, and later looking at the git history, difficult. Ideally tests should be introduced with the modifications they are related to. In this specific case, I would recommend one commit to propagate the status (+tests), then another commit to change the printed status line in case of failure to include the code.

Finally add a regression test in output_test.py for each commit's changes. I.e. one that verifies the value returned by Ninja when invoking a command that fails voluntarily, then another one that checks the text printed by Ninja in case of failure to verify it includes the right code.

There is a tiny chance that the status line change is going to break existing parsing scripts. I recommend you try to keep the original FAILED: sub-string in the output in case of failure to minimize that. I.e. print FAILED: [code=<value>] instead of FAILED[code=<value>]:. @jhasse what do you think about this specific problem?

@Felixoid
Copy link
Author

Felixoid commented Dec 9, 2024

Hi @digit-google, thanks for the review!

I think the only unclear thing to me is how to use ExitStatus for other codes than are defined there. I think, you mean using EXIT_STATUS_TYPE in all the places? Than, looks good to me.

I'll apply all the fixes

@digit-google
Copy link
Contributor

Can you clarify what you mean. It should be straightforward to just use the ExitStatus type for return values and local/member variables types. Use a static cast when you need to convert an int or DWORD into an ExitStatus in the subprocess Finish() functions, and that should be enough. Wdyt?

@Felixoid
Copy link
Author

Felixoid commented Dec 9, 2024

Just in case I am not a real C++ dev 👉👈, I could miss the point.
The idea is to exit Ninja with one of the exit codes of the subprocesses. So, return static_cast<ExitStatus>(42); should still work, right?

@digit-google
Copy link
Contributor

It's the other way around, lines 1804-1807 of ninja.cc look like this:

    int result = ninja.RunBuild(argc, argv, status);
    if (g_metrics)
      ninja.DumpMetrics();
    exit(result);

What I am suggesting is having NinjaMain::RunBuild() return an ExitStatus value, then do:

    ExitStatus status = ninja.RunBuild(argc, argv, status);
    if (g_metrics)
      ninja.DumpMetrics();
    exit(static_cast<int>(status));

@Felixoid
Copy link
Author

Felixoid commented Dec 9, 2024

but Finish() should as well return the same code it received from the subprocess, so it will be passed from there to ninja.RunBuild?

-int Subprocess::Finish() {
+ExitStatus Subprocess::Finish() {
   assert(pid_ != -1);
   int status;
   if (waitpid(pid_, &status, 0) < 0)
@@ -166,7 +168,7 @@ int Subprocess::Finish() {

   if (WIFEXITED(status)) {
     // propagate the status transparently
-    return WEXITSTATUS(status);
+    return static_cast<ExitStatus>(WEXITSTATUS(status));
   }
   if (WIFSIGNALED(status)) {
     // Overwrite interrupts to exit code 2
@@ -175,7 +177,7 @@ int Subprocess::Finish() {
       return ExitInterrupted;
   }
   // At this point, we exit with any other signal+128
-  return status | 0x80;
+  return static_cast<ExitStatus>(status | 0x80);
 }

@Felixoid Felixoid force-pushed the master branch 7 times, most recently from 24bca44 to 5579fc5 Compare December 9, 2024 23:36
@Felixoid
Copy link
Author

Felixoid commented Dec 9, 2024

I think all points are addressed:

  • ExitStatus is used everywhere to pass the exit codes
  • output_test.py has tests for both logging and exiting

Besides, I updated all the tests depending on Builder::Build, so they are also aware of the ExitStatus return.

src/build.h Outdated Show resolved Hide resolved
src/exit_status.h Outdated Show resolved Hide resolved
misc/output_test.py Outdated Show resolved Hide resolved
misc/output_test.py Outdated Show resolved Hide resolved
@digit-google
Copy link
Contributor

Thanks very much for the latest commits. I added a few review comments, please address them, but this is looking very good so far. Now it's up to @jhasse to add his feedback and approve the PR.

misc/output_test.py Outdated Show resolved Hide resolved
@Felixoid Felixoid force-pushed the master branch 2 times, most recently from c8dfe33 to cc694a7 Compare December 10, 2024 14:48
@digit-google
Copy link
Contributor

Thank you very much, latest version seems perfect to me.

Copy link
Collaborator

@jhasse jhasse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

From my experience you're entering a world of pain as soon as you want to do anything with exit codes outside of the range 0 to 254. Therefore I'm not sure if Ninja should support it, even on Windows.

I will think about it.

misc/output_test.py Outdated Show resolved Hide resolved
misc/output_test.py Outdated Show resolved Hide resolved
misc/output_test.py Outdated Show resolved Hide resolved
misc/output_test.py Outdated Show resolved Hide resolved
misc/output_test.py Outdated Show resolved Hide resolved
src/subprocess-posix.cc Outdated Show resolved Hide resolved
src/subprocess-posix.cc Outdated Show resolved Hide resolved
src/subprocess-posix.cc Outdated Show resolved Hide resolved
src/subprocess_test.cc Outdated Show resolved Hide resolved
src/subprocess_test.cc Outdated Show resolved Hide resolved
@Felixoid
Copy link
Author

Addressed most of the questions and made review-related changes as fixups to track the progress.

@Felixoid Felixoid requested a review from jhasse December 11, 2024 10:04
@Felixoid
Copy link
Author

Felixoid commented Dec 13, 2024

Can I kindly get another review round? If the things are addressed, I'd rebase everything into proper commits.

@Felixoid
Copy link
Author

I've rebased all the changes, let me know if something here is still blocker to proceed 🙏

Copy link
Collaborator

@jhasse jhasse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few nitpicks, besides that it LGTM :)

misc/output_test.py Outdated Show resolved Hide resolved
misc/output_test.py Show resolved Hide resolved
src/build.cc Outdated Show resolved Hide resolved
src/build.cc Outdated Show resolved Hide resolved
src/subprocess_test.cc Outdated Show resolved Hide resolved
@Felixoid
Copy link
Author

I reverted all things in output_test.py unrelated to the test itself, and will implement automatic checks for them separately.

The rest of the comments are addressed, including _WIN32

@Felixoid Felixoid requested a review from jhasse December 28, 2024 12:31
Make all the methods return ExitStatus:
  Subprocess::Finish->
  CommandRunner::Result.status->
  Builder::Build->
  NinjaMain::RunBuild

Ninja now return different exit codes for Win32 and Posix systems:
  - Win32 -> 2
  - Posix -> 130 (128+2)
@jhasse jhasse added this to the 1.13.0 milestone Dec 30, 2024
@jhasse
Copy link
Collaborator

jhasse commented Dec 30, 2024

Thanks! No more objections from my side, will wait a few days to see if anyone else has comments.

@digit-google
Copy link
Contributor

Thanks and happy new year :) The latest commits look fine to me as well. Great work.

@Felixoid
Copy link
Author

Felixoid commented Jan 6, 2025

Thanks! Happy New Year, too!

So, does it look like good to go?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants