-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --incompatible_enforce_starlark_utf8
#24944
base: master
Are you sure you want to change the base?
Conversation
@tetromino @tjgq I asked both of your for a review as I would like to make sure we are all on the same page regarding the path forward. I'm not necessarily suggesting that we should flip this to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this in principle. We'd need to run a benchmark, but if there is no performance impact on the loading phase for a large build, I might suggest setting the flag to "warning" immediately.
I don't agree with the name of the flag, because Bazel accepts multiple kinds of data:
- Starlark files on disk (for example, BUILD, .bzl, .scl, MODULE.bazel, WORKSPACE)
- Strings from command-line flags and stdin (for example, list of build targets or the text of a bazel query program)
- rpc messages from remote services (remote cache, remote execution, plus lots of google-internal stuff)
- and I am probably forgetting some things
Since we are only enforcing encoding for (1) - and realistically, enforcing (3) would have to be done on a different schedule - I'd suggest renaming to --incompatible_enforce_starlark_utf8
or similar
Ran a benchmark. No measurable impact. Let's set this to "warning" immediately :) |
f53740b
to
9c25457
Compare
I added another commit that sets the default to |
There are proprietary remote services inside Google, owned by other teams, which Blaze (Google-internal flavor of Bazel) connects to, and I do not want to give users the impression that we are enforcing anything about those services' encoding. So I think we need a name which suggests that this is an encoding only for Starlark things. As for more sources of user input - we should probably enforce the same encoding rules for cquery's |
9c25457
to
7de83ec
Compare
Makes sense. I renamed the flag and dropped the option value checking for now (it resulted in test failures that may require larger cleanup due to tests running with a Latin-1 locale by default). |
There are still a number of failures that I can't explain. I will investigate this more. |
--incompatible_enforce_utf8
--incompatible_enforce_starlark_utf8
@tetromino Failures are fixed, I forgot to update the default in semantics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
@bazel-io fork 8.1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just before merging this, I realized that that we are only enforcing UTF-8 encoding for .bzl files - not for BUILD files, MODULE.bazel, WORKSPACE, etc. (which are all also starlark files, but loaded through different code paths).
I assume that was not by design, and you do want to enforce utf-8 for them also?
That's certainly not intentional. Could you explain why and what the other code paths are? |
All of these call ParserInput.fromLatin1, which is how I was able to identify them. The conclusion is that "fromLatin1" is wrongly named, and that encoding verification should take place in ParserInput. |
Thanks for catching this. Since An alternative would be to return a new type from |
From the perspective of the interaction between the Starlark engine and the application in which it is embedded, there are 3 separate questions to consider:
For Bazel, for (1) is utf-8 (historically latin-1), (2) is a 3-way choice between nop, warning, or fail, and (3) is a pseudo-String in reality containing a byte array in the same encoding as (1). Meanwhile, non-Bazel users of Starlark (such as Copybara) have UTF-8 for (1), replacement characters for (2), and ordinary UTF-16 java String values for (3). For representing this in code, I think that (2) should be handled outside the Starlark interpreter. There is too much skew between how Skyframe handles errors/warnings and how Starlark does (it doesn't depend on an event bus and I don't want it to). So in this PR, let's just add a build/lib/skyframe/SkyframeUtil.java with a helper method that produces a ParserInput and warns/fails as appropriate depending on the For (1) and (3), I want to rename ParserInput.fromLatin1 and add comments explaining the situation, as a separate follow-up PR. |
If enabled (or set to 'error'), fail if Starlark files are not UTF-8 encoded. If set to 'warning', emits a warning instead. Bazel already assumes that Starlark files are UTF-8 encoded for e.g. filenames in actions executed remotely. This flag doesn't affect this, it only makes encoding failures more visible.
2c9c280
to
14720d6
Compare
I added a central helper function, please let me know whether that's what how you wanted it to look like. I can add more tests if needed after that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: rename to StarlarkUtil or similar (the name SkyframeUtil suggests that it's the one and only location for skyframe's helper functions, but there are already many other *Util and *Utils classes in build/lib/skyframe).
reporter.handle( | ||
Event.warn( | ||
Location.fromFile(file), | ||
"not a valid UTF-8 encoded file; this can lead to inconsistent behavior and" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: put the error message in a constant?
semantics.get(BuildLanguageOptions.INCOMPATIBLE_ENFORCE_STARLARK_UTF8), | ||
handler); | ||
if (input.isEmpty()) { | ||
return new CompiledBuildFile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CompiledBuildFile doesn't make sense here: we failed before even attempting to compile. There is no trace of an AST.
Instead, I'd throw a PackageFunctionException, something like
throw PackageFunctionException.builder()
.setType(PackageFunctionException.Type.BUILD_FILE_CONTAINS_ERRORS)
.setPackageIdentifier(packageId)
.setTransience(Transience.PERSISTENT)
.setException(...)
.setMessage("error reading " + inputFile.toString())
.setPackageLoadingCode(PackageLoading.Code.STARLARK_EVAL_ERROR)
.build();
} | ||
} | ||
case ERROR -> { | ||
if (!Utf8.isWellFormed(bytes)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of returning an empty optional, I'd recommend throwing an exception.
This would allow callers to easily put your exception's message into their own exceptions without needing to stream and stringify events (or they can just throw their own exception with your exception as the cause).
If enabled (or set to
error
), fail if Starlark files are not UTF-8 encoded. If set towarning
(the default), emits a warning instead.Bazel already assumes that Starlark files are UTF-8 encoded for e.g. filenames in actions executed remotely. This flag doesn't affect this, it only makes encoding failures more visible.
Work towards #374