Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --incompatible_enforce_starlark_utf8 #24944

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Jan 16, 2025

If enabled (or set to error), fail if Starlark files are not UTF-8 encoded. If set to warning (the default), emits a warning instead.

Bazel already assumes that Starlark files are UTF-8 encoded for e.g. filenames in actions executed remotely. This flag doesn't affect this, it only makes encoding failures more visible.

Work towards #374

@fmeum fmeum requested review from tetromino and tjgq January 16, 2025 12:35
@github-actions github-actions bot added the awaiting-review PR is awaiting review from an assigned reviewer label Jan 16, 2025
@fmeum
Copy link
Collaborator Author

fmeum commented Jan 16, 2025

@tetromino @tjgq I asked both of your for a review as I would like to make sure we are all on the same page regarding the path forward. I'm not necessarily suggesting that we should flip this to error soon, but even at warning level this at least gives us a tool to measure the impact of an eventual flip.

Copy link
Contributor

@tetromino tetromino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this in principle. We'd need to run a benchmark, but if there is no performance impact on the loading phase for a large build, I might suggest setting the flag to "warning" immediately.

I don't agree with the name of the flag, because Bazel accepts multiple kinds of data:

  1. Starlark files on disk (for example, BUILD, .bzl, .scl, MODULE.bazel, WORKSPACE)
  2. Strings from command-line flags and stdin (for example, list of build targets or the text of a bazel query program)
  3. rpc messages from remote services (remote cache, remote execution, plus lots of google-internal stuff)
  4. and I am probably forgetting some things

Since we are only enforcing encoding for (1) - and realistically, enforcing (3) would have to be done on a different schedule - I'd suggest renaming to --incompatible_enforce_starlark_utf8 or similar

@tetromino
Copy link
Contributor

We'd need to run a benchmark, but if there is no performance impact on the loading phase for a large build, I might suggest setting the flag to "warning" immediately.

Ran a benchmark. No measurable impact. Let's set this to "warning" immediately :)

@fmeum fmeum force-pushed the 437-incompatible-enforce-utf8 branch from f53740b to 9c25457 Compare January 16, 2025 17:59
@fmeum
Copy link
Collaborator Author

fmeum commented Jan 16, 2025

I added another commit that sets the default to warning and also checks option values. I can still change the name, but I can also expand it to cover more sources of user inputs.

@tetromino
Copy link
Contributor

I added another commit that sets the default to warning and also checks option values. I can still change the name, but I can also expand it to cover more sources of user inputs.

There are proprietary remote services inside Google, owned by other teams, which Blaze (Google-internal flavor of Bazel) connects to, and I do not want to give users the impression that we are enforcing anything about those services' encoding.

So I think we need a name which suggests that this is an encoding only for Starlark things.

As for more sources of user input - we should probably enforce the same encoding rules for cquery's --starlark:expr and --starlark:file (can be done in a follow-up pr)

@fmeum fmeum force-pushed the 437-incompatible-enforce-utf8 branch from 9c25457 to 7de83ec Compare January 16, 2025 20:58
@fmeum
Copy link
Collaborator Author

fmeum commented Jan 16, 2025

Makes sense. I renamed the flag and dropped the option value checking for now (it resulted in test failures that may require larger cleanup due to tests running with a Latin-1 locale by default).

@fmeum fmeum requested a review from tetromino January 16, 2025 20:59
@fmeum
Copy link
Collaborator Author

fmeum commented Jan 16, 2025

There are still a number of failures that I can't explain. I will investigate this more.

@iancha1992 iancha1992 added the team-Loading-API BUILD file and macro processing: labels, package(), visibility, glob label Jan 16, 2025
@fmeum fmeum changed the title Add --incompatible_enforce_utf8 Add --incompatible_enforce_starlark_utf8 Jan 17, 2025
@fmeum
Copy link
Collaborator Author

fmeum commented Jan 17, 2025

@tetromino Failures are fixed, I forgot to update the default in semantics.

Copy link
Contributor

@tetromino tetromino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@fmeum
Copy link
Collaborator Author

fmeum commented Jan 17, 2025

@bazel-io fork 8.1.0

Copy link
Contributor

@tetromino tetromino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just before merging this, I realized that that we are only enforcing UTF-8 encoding for .bzl files - not for BUILD files, MODULE.bazel, WORKSPACE, etc. (which are all also starlark files, but loaded through different code paths).

I assume that was not by design, and you do want to enforce utf-8 for them also?

@fmeum
Copy link
Collaborator Author

fmeum commented Jan 21, 2025

Just before merging this, I realized that that we are only enforcing UTF-8 encoding for .bzl files - not for BUILD files, MODULE.bazel, WORKSPACE, etc. (which are all also starlark files, but loaded through different code paths).

That's certainly not intentional. Could you explain why and what the other code paths are?

@tetromino
Copy link
Contributor

That's certainly not intentional. Could you explain why and what the other code paths are?

  • .bzl and .scl files are compiled by BzlCompileFunction.computeInline - that is what you already changed in this PR
  • BUILD files are compiled by PackageFunction.compileBuildFile
  • MODULE.bazel files are compiled by CompiledModuleFile.parseAndCompile
  • REPO.bazel by RepoFileFunction.readAndParseRepoFile
  • resolved files by ResolvedFileFunction.compute
  • VENDOR.bazel by VendorFileFunction.readAndParseVendorFile
  • WORKSPACE by WorkspaceFileFunciton.parseWorkspaceFile

All of these call ParserInput.fromLatin1, which is how I was able to identify them.

The conclusion is that "fromLatin1" is wrongly named, and that encoding verification should take place in ParserInput.

@fmeum
Copy link
Collaborator Author

fmeum commented Jan 21, 2025

Thanks for catching this.

Since ParserInput isn't configurable in any way, what do you think of making expressing the enforcement check as a FileOptions passed to Lexer (and persisting in ParserInput whether a given instance has been obtained from fromLatin1)?

An alternative would be to return a new type from fromLatin1 that includes both ParserInput and the result of the UTF-8 check.

@tetromino
Copy link
Contributor

tetromino commented Jan 21, 2025

From the perspective of the interaction between the Starlark engine and the application in which it is embedded, there are 3 separate questions to consider:

  1. what is the encoding of the input file fed into the Starlark parser
  2. what (if anything) should be done if the encoding is invalid
  3. what is the encoding of string values in the AST output by the parser

For Bazel, for (1) is utf-8 (historically latin-1), (2) is a 3-way choice between nop, warning, or fail, and (3) is a pseudo-String in reality containing a byte array in the same encoding as (1).

Meanwhile, non-Bazel users of Starlark (such as Copybara) have UTF-8 for (1), replacement characters for (2), and ordinary UTF-16 java String values for (3).


For representing this in code, I think that (2) should be handled outside the Starlark interpreter. There is too much skew between how Skyframe handles errors/warnings and how Starlark does (it doesn't depend on an event bus and I don't want it to).

So in this PR, let's just add a build/lib/skyframe/SkyframeUtil.java with a helper method that produces a ParserInput and warns/fails as appropriate depending on the --incompatible_enforce_starlark_utf8 flag. And use that method everywhere (except tests) in Bazel where we currently use fromLatin1.

For (1) and (3), I want to rename ParserInput.fromLatin1 and add comments explaining the situation, as a separate follow-up PR.

fmeum and others added 5 commits January 22, 2025 08:46
If enabled (or set to 'error'), fail if Starlark files are not UTF-8 encoded. If set to 'warning', emits a warning instead.

Bazel already assumes that Starlark files are UTF-8 encoded for e.g. filenames in actions executed remotely. This flag doesn't affect this, it only makes encoding failures more visible.
@fmeum fmeum force-pushed the 437-incompatible-enforce-utf8 branch from 2c9c280 to 14720d6 Compare January 22, 2025 08:50
@fmeum fmeum requested review from tetromino and removed request for Wyverald, brandjon, lberki, meteorcloudy and tjgq January 22, 2025 08:51
@fmeum
Copy link
Collaborator Author

fmeum commented Jan 22, 2025

I added a central helper function, please let me know whether that's what how you wanted it to look like. I can add more tests if needed after that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: rename to StarlarkUtil or similar (the name SkyframeUtil suggests that it's the one and only location for skyframe's helper functions, but there are already many other *Util and *Utils classes in build/lib/skyframe).

reporter.handle(
Event.warn(
Location.fromFile(file),
"not a valid UTF-8 encoded file; this can lead to inconsistent behavior and"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: put the error message in a constant?

semantics.get(BuildLanguageOptions.INCOMPATIBLE_ENFORCE_STARLARK_UTF8),
handler);
if (input.isEmpty()) {
return new CompiledBuildFile(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CompiledBuildFile doesn't make sense here: we failed before even attempting to compile. There is no trace of an AST.

Instead, I'd throw a PackageFunctionException, something like

    throw PackageFunctionException.builder()
        .setType(PackageFunctionException.Type.BUILD_FILE_CONTAINS_ERRORS)
        .setPackageIdentifier(packageId)
        .setTransience(Transience.PERSISTENT)
        .setException(...)
        .setMessage("error reading " + inputFile.toString())
        .setPackageLoadingCode(PackageLoading.Code.STARLARK_EVAL_ERROR)
        .build();

}
}
case ERROR -> {
if (!Utf8.isWellFormed(bytes)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of returning an empty optional, I'd recommend throwing an exception.

This would allow callers to easily put your exception's message into their own exceptions without needing to stream and stringify events (or they can just throw their own exception with your exception as the cause).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-review PR is awaiting review from an assigned reviewer team-Loading-API BUILD file and macro processing: labels, package(), visibility, glob
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants