Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quoting_style: Add support for non-UTF-8 bytes #6882

Merged
merged 5 commits into from
Dec 21, 2024

Conversation

jtracey
Copy link
Contributor

@jtracey jtracey commented Nov 23, 2024

This adds support for non-UTF-8 bytes in the quoting_style library on Unix platforms. This is necessary for proper support of non-unicode inputs in a few utilities, including wc, ls, and printf (as of this PR, wc should be good, ls is in a much better state but will need some work to close the final gaps, and printf needs @andrewliebenow's #6812, which might conflict this this, but if so, should be a quick fix).

The first commit bumps the MSRV, because we need access to Utf8Chunks, since we need to operate on strings and non-unicode bytes in the same OsString (namely, we need to be able to tell if something is invalid unicode, or valid unicode but a control character, and apply the appropriate escaping). Avoiding that would require implementing or using another UTF-8 parser.

The third commit fixes a preexisting bug that was in some sense independent of this patch set (multi-byte control characters weren't being handled properly), but it touches the same code so I'm including it.

Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

@sylvestre
Copy link
Contributor

is the increase of MSRV really necessary?

@jtracey
Copy link
Contributor Author

jtracey commented Nov 29, 2024

Assuming we don't want to use/implement a non-std UTF-8 parser, yes (see the second paragraph of the top post).

@sylvestre sylvestre force-pushed the quoting_style_bytes branch from b002ff2 to b34dd3b Compare December 2, 2024 09:00
@sylvestre
Copy link
Contributor

@cakebaker ok with the MSRV bump ?

@cakebaker
Copy link
Contributor

@sylvestre yes, that's fine for me.

@RenjiSann
Copy link
Contributor

I'm linking this PR to the issue I opened about it #6817
I support the bump of MSRV, as it grants us access to siice::utf8_chunks which is a mandatory piece we don't want to bother re-implementing ourselves

Lastly, you may want to take a look at my implementation here (which I refrained to open a PR for because of the MSRV bump) to see if we have some matching implementation details. I haven't checked your changes (i will when I get the time), but I think I remember my implementation was working, with 250 line changes to quoting_style.rs instead of your 650.

@jtracey
Copy link
Contributor Author

jtracey commented Dec 18, 2024

@RenjiSann

I'm linking this PR to the issue I opened about it #6817 I support the bump of MSRV, as it grants us access to siice::utf8_chunks which is a mandatory piece we don't want to bother re-implementing ourselves

Lastly, you may want to take a look at my implementation here (which I refrained to open a PR for because of the MSRV bump) to see if we have some matching implementation details.

Whoops, sorry! I checked for issues and PRs, not sure how I missed #6817; this ended up taking me a while, so I maybe started working on this before it was filed. In any case, I definitely should have found it before filing a PR.

I haven't checked your changes (i will when I get the time), but I think I remember my implementation was working, with 250 line changes to quoting_style.rs instead of your 650.

So it looks to me that your commit is analogous to my second commit in this patch series, 2f0072e. That commit is +401/-106, where over 200 of those new lines are additional unit tests, and another chunk is new comments on existing code, so comparable to yours in size. It looks like your commit doesn't implement returning literal bytes yet, basically getting things to the point where my ls usage is at with this PR.

I ran my unit test vectors with your branch, and as expected, your commit doesn't yet pass the "literal", "literal-show", "shell-show", or "shell-always-show" tests with non-unicode vectors, and fails the test vectors I add in e894e57. Other than that though, the tests do pass, so that's good news for the expected values I wrote. :)

As for the implementation, the strategies are pretty similar, if different in exact implementation: iterate over the UTF-8 chunks, keep the existing behavior if it's valid, do something smarter if it's not. No obvious sources of performance difference between the two versions, though I didn't run any benchmarks (your version does have dynamic dispatch in a potential hot path for long non-unicode names, but neither of us were especially careful about avoiding copies, and that path has a bunch of string construction anyway, so very unlikely to be noticeable).

Sorry again for not seeing the issue, but at least now we have two people proposing a similar approach!

@sylvestre
Copy link
Contributor

ok, too bad :(
so, could you please work together to find the best solution ? :)

Copy link
Contributor

@RenjiSann RenjiSann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your implementation is achieved while mine is only an unfinished draft. Aside from the few comments, everything looks good to me 👍

src/uucore/src/lib/features/quoting_style.rs Outdated Show resolved Hide resolved
src/uucore/src/lib/features/quoting_style.rs Outdated Show resolved Hide resolved
src/uucore/src/lib/features/quoting_style.rs Outdated Show resolved Hide resolved
src/uucore/src/lib/features/quoting_style.rs Outdated Show resolved Hide resolved
src/uu/ls/src/ls.rs Outdated Show resolved Hide resolved
src/uu/wc/src/wc.rs Outdated Show resolved Hide resolved
@jtracey jtracey force-pushed the quoting_style_bytes branch from b34dd3b to d91aac4 Compare December 18, 2024 19:04
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

This new functionality is implemented, but not yet exposed here.
This exposes the non-UTF-8 functionality to callers. Support in `argument`,
`spec`, and `wc` are implemented, as their usage is simple. A wrapper only
returning valid unicode is used in `ls`, since proper handling of OsStrings
there is more involved (outputs that escape non-unicode work now though).
@jtracey jtracey force-pushed the quoting_style_bytes branch from d91aac4 to 43229ae Compare December 18, 2024 20:28
@jtracey
Copy link
Contributor Author

jtracey commented Dec 18, 2024

(first force push is just a rebase on main, second is the suggested changes)

This adds the `os_str_as_bytes_lossy` function, for when we want infallible
conversion across platforms, and improves the doc comments of similar functions
to be more accurate and better formatted.
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/tail/inotify-dir-recreate (passes in this run but fails in the 'main' branch)

@jtracey
Copy link
Contributor Author

jtracey commented Dec 19, 2024

Assuming everyone is fine with the new commit, should be ready to go now.

@RenjiSann
Copy link
Contributor

I'm fine with all the changes 👌. I like the new docstrings of helper functions

@RenjiSann
Copy link
Contributor

ok, too bad :( so, could you please work together to find the best solution ? :)

@sylvestre It's good for both of us 👌

@sylvestre sylvestre merged commit bb2fb66 into uutils:main Dec 21, 2024
62 checks passed
@sylvestre
Copy link
Contributor

well done

@sylvestre
Copy link
Contributor

oh, i pressed "merge" too quickly
would it be possible to add some tests in test_ls.rs & test_wc.rs ?
thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants