Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider switching to an Order Sorted Algebra model? #13

Closed
alwinb opened this issue Oct 7, 2021 · 10 comments
Closed

Consider switching to an Order Sorted Algebra model? #13

alwinb opened this issue Oct 7, 2021 · 10 comments

Comments

@alwinb
Copy link
Owner

alwinb commented Oct 7, 2021

Currently URLs are modelled as sequences of tokens, where the token-values are percent-encoded.

In reurl, I have added a bit of information on the class that wraps around the URL: a flag that indicates if the token values are encoded or not.

I have been thinking for a while now, if it makes sense to add something similar to the specification. Not (just) for storing info about the percent-encoding, but as a way of distinguishing file, web, and generic-hierarchical and generic-non-hierarchical URLs.

The disadvantage is that it makes things more complex. But the advantage may be that it makes things more clear, instead :)

This is motivated by one of my ideas for an API that supports relative URLs: one based on these four 'sorts'.
In far most situations, it will be possible to coerce between these sorts implicitly, by only changing the wrapper.

Any ideas about this, @zamfofex, @TimothyGu perhaps?


A sketch for an API  

It doesn't make much sense to allow relative non-hierarchical URLs, nor coercions between these and the other sorts.
Non-hierarchical URLs can be represented by a single class, say OpaqueURL.

For the other three there can be classes:

  • GenericReference
  • FileReference
  • WebReference

Then the class that is used will select the parser-mode. Unless you need parsing support for backslashes, or drive-letters, or wish to use domains instead of an opaque-host, you can just use a GenericReference.

All of them could be coerced (some possibly failing) to the existing URL class.
(The other option is to use one 'Href' class and store the sort-info within, and expose a number of coercion methods on the class).

On a theoretical level, this corresponds to switching from a single-sorted algebra of URLs, to an order-sorted model. OSA (Order Sorted Algebra) is perfect for this, and really beautiful too :)

I don't think it would be too hard to figure out the coercions.

@alwinb alwinb changed the title Consider adding an URL-type header? Consider switching to an Order Sorted Algebra model? Oct 7, 2021
@ghost
Copy link

ghost commented Oct 7, 2021

I don’t think I fully understand what you are describing. How would e.g. a GenericReference differ semantically from a relative URL that was parsed under generic mode? Idem for WebReference and FileReference to web mode and file mode.

Is the idea that the operations available for those would differ? Such as e.g. it being impossible (to try) to add a drive letter to a non‐file relative URL?

Also, that reminds me, I think I remember (and I might be wrong) that it is possible to get a backslash in the path of a web or file URL by resolving a relative URL that was parsed in generic mode (and that has a backslash in its path) against them, which causes some expected properties to not hold. E.g. resolving foo/bar\baz against https://example/ would produce a https://example/foo/bar\baz, which does not round trip serializing and reparsing.

I wonder if this is related in some way. I think it’d make sense if the parsing mode used was kept in the URL record, so that resolution could only be made between two URLs parsed with the same mode.

That might make it a bit difficult to handle relative URLs transparently, though, since (the mode of) the base URL would have to be known when parsing before the relative URL can be handled programmatically, which I think is not desirable.

@alwinb
Copy link
Owner Author

alwinb commented Oct 7, 2021

I wonder if this is related in some way. I think it’d make sense if the parsing mode used was kept in the URL record, so that resolution could only be made between two URLs parsed with the same mode.

This :) Thanks for cleaning up my mind :)
The only thing I want to add to that, is to allow automatic coercions where possible, to avoid the following as much as possible.

That might make it a bit difficult to handle relative URLs transparently, though, since (the mode of) the base URL would have to be known when parsing before the relative URL can be handled programmatically, which I think is not desirable.

@alwinb
Copy link
Owner Author

alwinb commented Oct 7, 2021

Also, that reminds me, I think I remember (and I might be wrong) that it is possible to get a backslash in the path of a web or file URL by resolving a relative URL that was parsed in generic mode (and that has a backslash in its path) against them.

That's a good thing to be aware of.
This does make it possible to end up with a web-URL that has a path component that contains a backslash. But it will be percent encoded upon printing, so this is still safe.

With reurl:

> rel = new Url ('foo\\bar', { parser:'non-spec' })
Url { file: 'foo\\bar' }
> url = new Url ('http://foo') .goto (rel)
Url { scheme: 'http', host: 'foo', file: 'foo\\bar', root: '/' }
> String (url)
'http://foo/foo%5Cbar'

More edits

This is problematic though:

> rel = new Url ('c|/foo\\bar', { parser:'file' })
Url { drive: 'c|', root: '/', dirs: [ 'foo' ], file: 'bar' }
> r = new Url ('http://host/') .goto (rel)
Url {
  scheme: 'http',
  host: 'host',
  drive: 'c|',
  root: '/',
  dirs: [ 'foo' ],
  file: 'bar'
}

@alwinb
Copy link
Owner Author

alwinb commented Oct 24, 2021

Actually, the latter example should be considered a bug in reurl and should throw an error. A more interesting one:

> rel = new Url ('c|/foo\\bar', { parser:'file' })
Url { drive: 'c|', root: '/', dirs: [ 'foo' ], file: 'bar' }
> r = new Url ('//host/', { parser:'http' }) .goto (rel)

I think it should throw, too. However with rel = new Url ('/foo\\bar', { parser:'file' }) I currently think recovering is fine.

@alwinb
Copy link
Owner Author

alwinb commented Oct 24, 2021

I want to solve this issue by storing the 'mode' on the reurl object / and in my spec, and have the mode of url1 goto url2 be

  • the mode of url1 — if ord url1 < ord url2 or ord url1 == ord url2 == dir
  • the mode of url2 — otherwise

Intuitively; the mode of the 'source' of the first token in the resulting URL.

I quite like this idea so far.

@ghost
Copy link

ghost commented Nov 14, 2021

[click to expand] tangent about percent encoding

But it will be percent encoded upon printing, so this is still safe.

It’s been a while since I have fully read your spec for the first, so I’m probably completely misremembering, but at some point I thought percent encoding was a separate step.

If it is not (and I don’t think it actually ever was), I really feel like that would make a lot of sense! So you’d have URLs that can contain non‐ASCII characters in a way that would preserve them when printing/serializing, then you’d have a separate step that converts the characters in each token to be ASCII‐only (using percent encoding and punycode as appropriate). And maybe also the converse operation.

The reason I think this makes a lot of senseis that it would allow people to control the behavior of how converting to and from ASCII occurs. You could have a per token conversion (that chooses between punycode and percent encoding), and also allow for people to choose the behavior of malformed encodings, e.g. xn--.com, xn--abc.com %GG, %00. (See e.g. whatwg/url#438 and whatwg/url#603)

Though that then raises the question of whether the token strings should be text or byte strings, since when abandoning ASCII, you lose the isomorphism. This is also fairly relevant because percent encoding rather unfortunately operates on bytes.

If you feel like it would be worthwhile, I could open a new issue about this!

[end of tangent about percent encoding]


Honestly, the only issue I have with the idea is that it introduces hidden state to URLs that is not preserved when serializing. I think this can cause a lot of confusion with how seemingly identical URLs behave differently. I think serialization (the URL string) should be able to effectively fully describe the URL, such that then parsing it will produce an identical URL.

It would be neat if the parser mode could only have an effect in URLs that cannot be produced by the serializer. Such that when printing, all URLs would be normalized in a way that makes them behave the same in each distinct mode. I’m not sure if that’s currently the case, but I think that would be really elegant!

...
> r = new Url ('//host/', { parser:'http' }) .goto (rel)

I think it should throw, too.

Why, though? I suppose it feels intutive that new Url("...", {parser: "http"}) would produce an HTTP URL (a “web URL”), and thus that it should behave like one?

I suppose that’s not unreasonable. It feels like it’d be useful to allow for people to describe URLs that are expected to be used in a certain way and get an error early on. That is to say, even if I have only //host, if I know this will eventually be used as a web URL (i.e. be resolved against a web URL), then it makes sense to want to catch a mistake such as a drive being present early on, before it is consolidated into an actual web URL.

I wonder if that could be separated into a different operation… I guess maybe, but that would be really inconvenient to use, I think.

Would it make sense to only keep the “parser mode used to create this URL” in the library itself, rather than on the spec? Then that mode could be used to parse relative URL strings, like url.rel("foo\bar"), but would be ignored when passing a URL record to be resolved against, like url1.rel(url2). Of course that then brings up the “confusion about seemingly identical URLs” aspect again, I suppose.

@alwinb
Copy link
Owner Author

alwinb commented Nov 14, 2021

Honestly, the only issue I have with the idea is that it introduces hidden state to URLs that is not preserved when serializing.

Yes, that is a downside. And this, is a clever idea:

It would be neat if the parser mode could only have an effect in URLs that cannot be produced by the serializer.

Let's try it:

The modes can be broken down to three booleans: 1. converting of backslashes to slashes; 2. detecting drive letters; 3. parsed- vs. opaque hosts.

  1. is covered by the serialiser. Almost. foo\bar/ may be produced from a generic URL with a single dir token. Parsed in web-mode it has two dir tokens. However, the \ indicates that it must've been a generic URL otherwise it would have been encoded by the serialiser. Also, foo\bar/ is not valid, or rather, sc:/foo\bar/ is not a valid WHATWG URL. But the serialiser can still produce that as per whatwg/url#379.
  2. isn't covered. /c:/foo as a "web-URL" consists of a root, a dir and a file token. However I could specify that such ambiguous drive-letter-like components must be encoded in scheme-less URLs, as the WHATWG doesn't cover those.
  3. the host of //foo as a generic URL is an opaque-host. As a web-URL it is a domain. But mapping a domain to an opaque-host and back is idempotent I think, so it's not too bad. And if the serialiser produces an URL-string with a host that cannot be parsed as a domain, then it must've been an URL with an opaque host originally.

So surprisingly, this may work, allowing some ambiguity in 3.

Would it make sense to only keep the “parser mode used to create this URL” in the library itself, rather than on the spec?

That's what I've done for reurl so far, (but I've not pushed the changes yet)

@ghost
Copy link

ghost commented Nov 14, 2021

  1. I don’t understand how foo\bar or xxx:/foo\bar/ can be produced by the serializer. Wouldn’t the backslash always be escaped?
  2. Hmm. This seems really unfortunate, because whichever way you decide to disambiguate, you then risk misinterpreting the other case. (I don’t understand what solution you are proposing.) The WHATWG view on drives (to model them as a path component) is really succint in that regard, I think. Then “being a drive letter” becomes a property of a “path part” token, rather than being a token type by itself. But then that quickly starts asking for are “root” and “file” really necessary? #1 and running into all the unfortunacies of that idea.
  3. Is there any case in which it is unfavorable to parse a potential domain as an opaque host? Does it affect the host in some way? That is, is there some normalization that is performed on domains that is undesirable for opaque hosts?

Honestly, the message I sent was a bit scatterbrained, so I’m kinda surprised that there are so few issues with that idea.

@ghost
Copy link

ghost commented Nov 17, 2021

I decided to read the spec more carefully again, and I formulated some thoughts. (Not all of which are on‐topic in this issue.)

It seems it does decouple percent encoding from serialization, which I think is really neat!

Other than that, I see now how aaa\bbb can be produced by the serializer, since the percent‐encoding set used for generic URLs differ by that used for special URLs.

Since the WHATWG spec doesn’t specify how schemeless URLs are serialized, would it make sense to use the minimal-special percent‐encoding set for those? I think that might eliminate the need for supplying a percent‐encoding set explictly, no? (Though it’s probably still useful to allow it.)

I suppose the normalization I asked about with hosts has to do with IP addresses. I wonder if there are any cases in which it is preferred to skip it. Generally if the host looks like an IP address, it is meant to be one. (Though maybe not for legacy IP addresses such as e.g. 2130706433, I suppose.)

It still feels really tempting for me to say that it would make sense to have “being a drive” be a property of a dir or file token (adapting upto appropriately), but I don’t think it works quite succinctly enough.

Also: Honestly, I’m not sure if this whole conversation is appropriate for this issue. I think it deviates from what the issue proposes, but it tries to tackle the same problems using a different approach.

@alwinb
Copy link
Owner Author

alwinb commented Jan 28, 2022

I created a new issue for this, which I think states the goal more clearly. And I started to look at decomposing the now 'opaque' modes into configs that can mix and match the desired behaviours.

@alwinb alwinb closed this as completed Feb 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant