Add `--incompatible_enforce_starlark_utf8` #24944

fmeum · 2025-01-16T12:34:54Z

If enabled (or set to error), fail if Starlark files are not UTF-8 encoded. If set to warning (the default), emits a warning instead.

Bazel already assumes that Starlark files are UTF-8 encoded for e.g. filenames in actions executed remotely. This flag doesn't affect this, it only makes encoding failures more visible.

Work towards #374

fmeum · 2025-01-16T12:37:21Z

@tetromino @tjgq I asked both of your for a review as I would like to make sure we are all on the same page regarding the path forward. I'm not necessarily suggesting that we should flip this to error soon, but even at warning level this at least gives us a tool to measure the impact of an eventual flip.

tetromino

I agree with this in principle. We'd need to run a benchmark, but if there is no performance impact on the loading phase for a large build, I might suggest setting the flag to "warning" immediately.

I don't agree with the name of the flag, because Bazel accepts multiple kinds of data:

Starlark files on disk (for example, BUILD, .bzl, .scl, MODULE.bazel, WORKSPACE)
Strings from command-line flags and stdin (for example, list of build targets or the text of a bazel query program)
rpc messages from remote services (remote cache, remote execution, plus lots of google-internal stuff)
and I am probably forgetting some things

Since we are only enforcing encoding for (1) - and realistically, enforcing (3) would have to be done on a different schedule - I'd suggest renaming to --incompatible_enforce_starlark_utf8 or similar

tetromino · 2025-01-16T16:25:07Z

We'd need to run a benchmark, but if there is no performance impact on the loading phase for a large build, I might suggest setting the flag to "warning" immediately.

Ran a benchmark. No measurable impact. Let's set this to "warning" immediately :)

fmeum · 2025-01-16T18:00:26Z

I added another commit that sets the default to warning and also checks option values. I can still change the name, but I can also expand it to cover more sources of user inputs.

tetromino · 2025-01-16T18:38:13Z

I added another commit that sets the default to warning and also checks option values. I can still change the name, but I can also expand it to cover more sources of user inputs.

There are proprietary remote services inside Google, owned by other teams, which Blaze (Google-internal flavor of Bazel) connects to, and I do not want to give users the impression that we are enforcing anything about those services' encoding.

So I think we need a name which suggests that this is an encoding only for Starlark things.

As for more sources of user input - we should probably enforce the same encoding rules for cquery's --starlark:expr and --starlark:file (can be done in a follow-up pr)

fmeum · 2025-01-16T20:59:10Z

Makes sense. I renamed the flag and dropped the option value checking for now (it resulted in test failures that may require larger cleanup due to tests running with a Latin-1 locale by default).

fmeum · 2025-01-16T21:42:08Z

There are still a number of failures that I can't explain. I will investigate this more.

fmeum · 2025-01-17T09:02:36Z

@tetromino Failures are fixed, I forgot to update the default in semantics.

tetromino

Looks good, thanks!

fmeum · 2025-01-17T18:41:10Z

@bazel-io fork 8.1.0

tetromino

Just before merging this, I realized that that we are only enforcing UTF-8 encoding for .bzl files - not for BUILD files, MODULE.bazel, WORKSPACE, etc. (which are all also starlark files, but loaded through different code paths).

I assume that was not by design, and you do want to enforce utf-8 for them also?

fmeum · 2025-01-21T16:09:54Z

Just before merging this, I realized that that we are only enforcing UTF-8 encoding for .bzl files - not for BUILD files, MODULE.bazel, WORKSPACE, etc. (which are all also starlark files, but loaded through different code paths).

That's certainly not intentional. Could you explain why and what the other code paths are?

tetromino · 2025-01-21T18:13:56Z

That's certainly not intentional. Could you explain why and what the other code paths are?

.bzl and .scl files are compiled by BzlCompileFunction.computeInline - that is what you already changed in this PR
BUILD files are compiled by PackageFunction.compileBuildFile
MODULE.bazel files are compiled by CompiledModuleFile.parseAndCompile
REPO.bazel by RepoFileFunction.readAndParseRepoFile
resolved files by ResolvedFileFunction.compute
VENDOR.bazel by VendorFileFunction.readAndParseVendorFile
WORKSPACE by WorkspaceFileFunciton.parseWorkspaceFile

All of these call ParserInput.fromLatin1, which is how I was able to identify them.

The conclusion is that "fromLatin1" is wrongly named, and that encoding verification should take place in ParserInput.

fmeum · 2025-01-21T18:39:09Z

Thanks for catching this.

Since ParserInput isn't configurable in any way, what do you think of making expressing the enforcement check as a FileOptions passed to Lexer (and persisting in ParserInput whether a given instance has been obtained from fromLatin1)?

An alternative would be to return a new type from fromLatin1 that includes both ParserInput and the result of the UTF-8 check.

tetromino · 2025-01-21T20:28:18Z

From the perspective of the interaction between the Starlark engine and the application in which it is embedded, there are 3 separate questions to consider:

what is the encoding of the input file fed into the Starlark parser
what (if anything) should be done if the encoding is invalid
what is the encoding of string values in the AST output by the parser

For Bazel, for (1) is utf-8 (historically latin-1), (2) is a 3-way choice between nop, warning, or fail, and (3) is a pseudo-String in reality containing a byte array in the same encoding as (1).

Meanwhile, non-Bazel users of Starlark (such as Copybara) have UTF-8 for (1), replacement characters for (2), and ordinary UTF-16 java String values for (3).

For representing this in code, I think that (2) should be handled outside the Starlark interpreter. There is too much skew between how Skyframe handles errors/warnings and how Starlark does (it doesn't depend on an event bus and I don't want it to).

So in this PR, let's just add a build/lib/skyframe/SkyframeUtil.java with a helper method that produces a ParserInput and warns/fails as appropriate depending on the --incompatible_enforce_starlark_utf8 flag. And use that method everywhere (except tests) in Bazel where we currently use fromLatin1.

For (1) and (3), I want to rename ParserInput.fromLatin1 and add comments explaining the situation, as a separate follow-up PR.

If enabled (or set to 'error'), fail if Starlark files are not UTF-8 encoded. If set to 'warning', emits a warning instead. Bazel already assumes that Starlark files are UTF-8 encoded for e.g. filenames in actions executed remotely. This flag doesn't affect this, it only makes encoding failures more visible.

fmeum · 2025-01-22T08:55:00Z

I added a central helper function, please let me know whether that's what how you wanted it to look like. I can add more tests if needed after that.

tetromino · 2025-01-22T21:54:41Z

src/main/java/com/google/devtools/build/lib/skyframe/SkyframeUtil.java

Nit: rename to StarlarkUtil or similar (the name SkyframeUtil suggests that it's the one and only location for skyframe's helper functions, but there are already many other *Util and *Utils classes in build/lib/skyframe).

tetromino · 2025-01-22T21:58:17Z

src/main/java/com/google/devtools/build/lib/skyframe/SkyframeUtil.java

+          reporter.handle(
+              Event.warn(
+                  Location.fromFile(file),
+                  "not a valid UTF-8 encoded file; this can lead to inconsistent behavior and"


Nit: put the error message in a constant?

tetromino · 2025-01-22T22:17:14Z

src/main/java/com/google/devtools/build/lib/skyframe/PackageFunction.java

+            semantics.get(BuildLanguageOptions.INCOMPATIBLE_ENFORCE_STARLARK_UTF8),
+            handler);
+    if (input.isEmpty()) {
+      return new CompiledBuildFile(


CompiledBuildFile doesn't make sense here: we failed before even attempting to compile. There is no trace of an AST.

Instead, I'd throw a PackageFunctionException, something like

throw PackageFunctionException.builder() .setType(PackageFunctionException.Type.BUILD_FILE_CONTAINS_ERRORS) .setPackageIdentifier(packageId) .setTransience(Transience.PERSISTENT) .setException(...) .setMessage("error reading " + inputFile.toString()) .setPackageLoadingCode(PackageLoading.Code.STARLARK_EVAL_ERROR) .build();

tetromino · 2025-01-22T22:20:47Z

src/main/java/com/google/devtools/build/lib/skyframe/SkyframeUtil.java

+        }
+      }
+      case ERROR -> {
+        if (!Utf8.isWellFormed(bytes)) {


Instead of returning an empty optional, I'd recommend throwing an exception.

This would allow callers to easily put your exception's message into their own exceptions without needing to stream and stringify events (or they can just throw their own exception with your exception as the cause).

fmeum requested review from tetromino and tjgq January 16, 2025 12:35

github-actions bot added the awaiting-review PR is awaiting review from an assigned reviewer label Jan 16, 2025

tetromino requested changes Jan 16, 2025

View reviewed changes

fmeum force-pushed the 437-incompatible-enforce-utf8 branch from f53740b to 9c25457 Compare January 16, 2025 17:59

fmeum force-pushed the 437-incompatible-enforce-utf8 branch from 9c25457 to 7de83ec Compare January 16, 2025 20:58

fmeum requested a review from tetromino January 16, 2025 20:59

iancha1992 added the team-Loading-API BUILD file and macro processing: labels, package(), visibility, glob label Jan 16, 2025

fmeum changed the title ~~Add --incompatible_enforce_utf8~~ Add --incompatible_enforce_starlark_utf8 Jan 17, 2025

tetromino approved these changes Jan 17, 2025

View reviewed changes

bazel-io mentioned this pull request Jan 17, 2025

[8.1.0] Add --incompatible_enforce_starlark_utf8 #24956

Open

tetromino requested changes Jan 21, 2025

View reviewed changes

fmeum and others added 5 commits January 22, 2025 08:46

Rename to --incompatible_enforce_starlark_utf8

26ddb05

Fix tests

1cc5720

Minor formatting fixes and wording changes

9a4c9b5

Use helper function

14720d6

fmeum force-pushed the 437-incompatible-enforce-utf8 branch from 2c9c280 to 14720d6 Compare January 22, 2025 08:50

fmeum requested review from brandjon, lberki, Wyverald and meteorcloudy as code owners January 22, 2025 08:50

fmeum requested review from tetromino and removed request for Wyverald, brandjon, lberki, meteorcloudy and tjgq January 22, 2025 08:51

tetromino requested changes Jan 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `--incompatible_enforce_starlark_utf8` #24944

Add `--incompatible_enforce_starlark_utf8` #24944

fmeum commented Jan 16, 2025 •

edited

Loading

fmeum commented Jan 16, 2025

tetromino left a comment

tetromino commented Jan 16, 2025

fmeum commented Jan 16, 2025

tetromino commented Jan 16, 2025

fmeum commented Jan 16, 2025

fmeum commented Jan 16, 2025

fmeum commented Jan 17, 2025

tetromino left a comment

fmeum commented Jan 17, 2025

tetromino left a comment

fmeum commented Jan 21, 2025

tetromino commented Jan 21, 2025

fmeum commented Jan 21, 2025

tetromino commented Jan 21, 2025 •

edited

Loading

fmeum commented Jan 22, 2025

tetromino Jan 22, 2025

tetromino Jan 22, 2025

tetromino Jan 22, 2025

tetromino Jan 22, 2025

Add --incompatible_enforce_starlark_utf8 #24944

Are you sure you want to change the base?

Add --incompatible_enforce_starlark_utf8 #24944

Conversation

fmeum commented Jan 16, 2025 • edited Loading

fmeum commented Jan 16, 2025

tetromino left a comment

Choose a reason for hiding this comment

tetromino commented Jan 16, 2025

fmeum commented Jan 16, 2025

tetromino commented Jan 16, 2025

fmeum commented Jan 16, 2025

fmeum commented Jan 16, 2025

fmeum commented Jan 17, 2025

tetromino left a comment

Choose a reason for hiding this comment

fmeum commented Jan 17, 2025

tetromino left a comment

Choose a reason for hiding this comment

fmeum commented Jan 21, 2025

tetromino commented Jan 21, 2025

fmeum commented Jan 21, 2025

tetromino commented Jan 21, 2025 • edited Loading

fmeum commented Jan 22, 2025

tetromino Jan 22, 2025

Choose a reason for hiding this comment

tetromino Jan 22, 2025

Choose a reason for hiding this comment

tetromino Jan 22, 2025

Choose a reason for hiding this comment

tetromino Jan 22, 2025

Choose a reason for hiding this comment

Add `--incompatible_enforce_starlark_utf8` #24944

Add `--incompatible_enforce_starlark_utf8` #24944

fmeum commented Jan 16, 2025 •

edited

Loading

tetromino commented Jan 21, 2025 •

edited

Loading