Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setters can cause failure to roundtrip #651

Closed
TimothyGu opened this issue Sep 24, 2021 · 15 comments · Fixed by #728
Closed

Setters can cause failure to roundtrip #651

TimothyGu opened this issue Sep 24, 2021 · 15 comments · Fixed by #728

Comments

@TimothyGu
Copy link
Member

TimothyGu commented Sep 24, 2021

Generally, we assume setters will keep the URL valid and roundtrippable. However, this turns out not to be true in a specific case:

u = new URL('data:text/html,hello #hash');
u.hash = '';
console.log(JSON.stringify(u.href)); // "data:text/html,hello "

u2 = new URL(u); // constructor strips leading/trailing spaces as a first step
console.log(JSON.stringify(u2.href)); // "data:text/html,hello" without the trailing space

This is only an issue because we allow spaces to appear verbatim in cannot-be-a-base URL paths.

(first discovered in https://crbug.com/291747)

@TimothyGu
Copy link
Member Author

TimothyGu commented Sep 24, 2021

A few options here:

  1. Concede that setters may make URLs non-roundtrippable, and be okay with that.
  2. In a setter, if the particular component is completely removed, check if the serialization ends with a space, and trim the path if needed.
  3. Don't allow trailing spaces in cannot-be-a-base URL paths in the initial parse, so parsing data:text/html,hello #hash would result in data:text/html,hello#hash.
  4. Disallow hash and search setters on cannot-be-a-base URLs, like how path is already a no-op.

Option 1 is probably the most web-compatible for the most content, but leaves a weird edge case and a sour taste in my mouth. Option 2 can be a bit surprising to the user (the pathname can change even when only hash is being set), but probably also web-compatible too. Option 3 is the most theoretically pleasing, but I would not be surprised if some web content depends on data:text/html,hello ? world being rendered as hello ? world. (After the change, it would be rendered instead as hello? world.) Option 4 is perhaps too radical to fix such an edge case.

I'm leaning towards option 2.

@annevk
Copy link
Member

annevk commented Sep 27, 2021

  1. We could also say that in such cases the mapping of empty string to null does not happen. I.e., we would preserve # or ? in the serialized URL.
  2. We percent-encode the final code point of the path or query as needed.

I think I like these a bit better than 2. 4 would be problematic for fragment navigation, which is definitely a thing in data: URLs. 3 seems a little surprising. Agreed on 1.

@TimothyGu
Copy link
Member Author

I think I prefer 2 over 5. With 2, users have more options: if they want to keep the space and the delimiter, then they can explicitly opt into it by setting .hash = '#'. With 5, there's no way to remove the # through setters alone.

6 seems a little surprising – we never only percent-encode one character. Chrome is also already running into some trouble trying to percent-encode more characters in non-special URLs (https://crbug.com/1252739), so ideally I'd like to keep the path the way it is.

@karwa
Copy link
Contributor

karwa commented Sep 28, 2021

  1. Start percent-encoding spaces in the paths of cannot-be-a-base URLs

I can't think of anywhere else where naked spaces are allowed (not in the scheme, query, fragment, userinfo, not in domains, not in opaque hostnames, etc) and RFC-3986 also doesn't allow them. Are there use-cases which absolutely depend on non-encoded spaces being preserved verbatim in CBAB URL paths?

@TimothyGu
Copy link
Member Author

TimothyGu commented Sep 28, 2021

@karwa The ability to percent-encode unusual characters in CBAB paths would be the ideal case. Sadly, I doubt that will be web-compatible. In fact, I think it'll be difficult just to encode spaces in hierarchical non-special URL paths (in Chrome and Firefox that is – Safari already does that).

@annevk
Copy link
Member

annevk commented Oct 21, 2021

@achristensen07 @valenting thoughts? I could live with 2.

@valenting
Copy link
Collaborator

I think that no.2 is the least likely to cause breakage, so that's a ➕ for me.

@alwinb
Copy link
Contributor

alwinb commented Nov 22, 2021

Generally, we assume setters will keep the URL valid

Not particularly relevant to the discussion here, but be aware that setters do not preserve validity.
You can get a \ into a special URL's path or a # into a fragment, to name a few. See #379.

@TimothyGu
Copy link
Member Author

@alwinb I think we are using different definitions of "valid" here. When I say "setters will keep the URL valid" I don't mean "valid URL string" as defined in the spec, but rather "roundtrippable". If you have other examples of where setters make the URL non-roundtrippable, I'd certainly like to hear it.

@karwa
Copy link
Contributor

karwa commented Nov 22, 2021

Setters being round-trippable is nice. I'd like to add that I think setters should preserve the structure of the URL component, so that urlA.component = urlB.component should strive to not change the component's meaning.

One issue I've been wondering about bringing up is that non-special paths can have unescaped backslashes, but if you go through the setter, the URL parser will then interpret those backslashes as path separators and normalise them to forwardslashes. This violates the urlA.component = urlB.component idea if the path is set on a special URL (urlA is special, urlB is not special). An example from my library:

// Backslashes are not percent-encoded in non-special URLs,
// so copying the path to a special URL can cause it to be interpreted differently.

var specialURL = WebURL("https://example.com/")!
specialURL.serialized()  // "https://example.com/"
specialURL.path,         // "/"
specialURL.pathComponents.count // 1

let nonSpecialURL = WebURL(#"foo://example.com/baz\qux"#)!
nonSpecialURL.serialized()  // "foo://example.com/baz\qux"
nonSpecialURL.path          // "/baz\qux"
nonSpecialURL.pathComponents.count. // 1

specialURL.path = nonSpecialURL.path

specialURL.serialized()  // "https://example.com/baz/qux" (!)
specialURL.path          // "/baz/qux" (!)
specialURL.pathComponents.count  // 2 (!)

We could get around this if the URL parser percent-encoded each component in list-style paths with a "path component" encode-set, which would consist of the path set + backslash. Currently the path set does not include path separators themselves.

@TimothyGu
Copy link
Member Author

TimothyGu commented Nov 23, 2021

@karwa Escaping \ in non-special paths sounds like a reasonable idea, but I'm not sure about the web compatibility. Browsers are consistent in not escaping \, even though RFC 3986 disallows \ in paths and several non-browser implementations (e.g., Go) do escape it.

Either way it should probably be in a separate ticket. (Edit: created #675 to track this.)

@achristensen07
Copy link
Collaborator

I like 2, which is effectively what Safari already does and we haven't seen any compatibility issues from it.

@TimothyGu
Copy link
Member Author

@achristensen07 Oh interesting! Didn't realize Safari already fixed this in some way. Sounds like we have alignment on 2 then.

annevk added a commit that referenced this issue Dec 16, 2022
As opaque paths can end in U+0020, those trailing U+0020 code points need to be removed from the path when both query and fragment become null.

Tests: ...

Fixes #651.
@annevk
Copy link
Member

annevk commented Dec 16, 2022

I got around to creating tests for this as well as a PR: #728. Safari does indeed seem to do this correctly (although there are some unrelated pathname setter issues I found while testing).

I noted Gecko down as supportive per @valenting's comment above.

@annevk
Copy link
Member

annevk commented Dec 22, 2022

Those pathname setter issues were not quite unrelated as @domenic noticed during review (thanks!). I mistakenly thought url.pathname = "blah " would already result in "blah" (trailing space removed), but that is not the case.

It turns out that you cannot use pathname with opaque paths. And when the path is not opaque percent-encoding of U+0020 happens, though that is not what WebKit currently implements. (And other browsers are generally poor for non-special URLs. Hopefully that'll improve this year!)

Tests will be updated to reflect all this.

annevk added a commit that referenced this issue Jan 2, 2023
As opaque paths can end in U+0020, those trailing U+0020 code points need to be removed from the path when both query and fragment become null.

Tests: web-platform-tests/wpt#37556.

Fixes #651.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

6 participants