-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider switching to an Order Sorted Algebra model? #13
Comments
I don’t think I fully understand what you are describing. How would e.g. a Is the idea that the operations available for those would differ? Such as e.g. it being impossible (to try) to add a drive letter to a non‐file relative URL? Also, that reminds me, I think I remember (and I might be wrong) that it is possible to get a backslash in the path of a web or file URL by resolving a relative URL that was parsed in generic mode (and that has a backslash in its path) against them, which causes some expected properties to not hold. E.g. resolving I wonder if this is related in some way. I think it’d make sense if the parsing mode used was kept in the URL record, so that resolution could only be made between two URLs parsed with the same mode. That might make it a bit difficult to handle relative URLs transparently, though, since (the mode of) the base URL would have to be known when parsing before the relative URL can be handled programmatically, which I think is not desirable. |
This :) Thanks for cleaning up my mind :)
|
That's a good thing to be aware of. With reurl:
More edits This is problematic though:
|
Actually, the latter example should be considered a bug in reurl and should throw an error. A more interesting one:
I think it should throw, too. However with |
I want to solve this issue by storing the 'mode' on the reurl object / and in my spec, and have the mode of url1 goto url2 be
Intuitively; the mode of the 'source' of the first token in the resulting URL. I quite like this idea so far. |
[click to expand] tangent about percent encoding
It’s been a while since I have fully read your spec for the first, so I’m probably completely misremembering, but at some point I thought percent encoding was a separate step. If it is not (and I don’t think it actually ever was), I really feel like that would make a lot of sense! So you’d have URLs that can contain non‐ASCII characters in a way that would preserve them when printing/serializing, then you’d have a separate step that converts the characters in each token to be ASCII‐only (using percent encoding and punycode as appropriate). And maybe also the converse operation. The reason I think this makes a lot of senseis that it would allow people to control the behavior of how converting to and from ASCII occurs. You could have a per token conversion (that chooses between punycode and percent encoding), and also allow for people to choose the behavior of malformed encodings, e.g. Though that then raises the question of whether the token strings should be text or byte strings, since when abandoning ASCII, you lose the isomorphism. This is also fairly relevant because percent encoding rather unfortunately operates on bytes. If you feel like it would be worthwhile, I could open a new issue about this! [end of tangent about percent encoding] Honestly, the only issue I have with the idea is that it introduces hidden state to URLs that is not preserved when serializing. I think this can cause a lot of confusion with how seemingly identical URLs behave differently. I think serialization (the URL string) should be able to effectively fully describe the URL, such that then parsing it will produce an identical URL. It would be neat if the parser mode could only have an effect in URLs that cannot be produced by the serializer. Such that when printing, all URLs would be normalized in a way that makes them behave the same in each distinct mode. I’m not sure if that’s currently the case, but I think that would be really elegant!
Why, though? I suppose it feels intutive that I suppose that’s not unreasonable. It feels like it’d be useful to allow for people to describe URLs that are expected to be used in a certain way and get an error early on. That is to say, even if I have only I wonder if that could be separated into a different operation… I guess maybe, but that would be really inconvenient to use, I think. Would it make sense to only keep the “parser mode used to create this URL” in the library itself, rather than on the spec? Then that mode could be used to parse relative URL strings, like |
Yes, that is a downside. And this, is a clever idea:
Let's try it: The modes can be broken down to three booleans: 1. converting of backslashes to slashes; 2. detecting drive letters; 3. parsed- vs. opaque hosts.
So surprisingly, this may work, allowing some ambiguity in 3.
That's what I've done for reurl so far, (but I've not pushed the changes yet) |
Honestly, the message I sent was a bit scatterbrained, so I’m kinda surprised that there are so few issues with that idea. |
I decided to read the spec more carefully again, and I formulated some thoughts. (Not all of which are on‐topic in this issue.) It seems it does decouple percent encoding from serialization, which I think is really neat! Other than that, I see now how Since the WHATWG spec doesn’t specify how schemeless URLs are serialized, would it make sense to use the I suppose the normalization I asked about with hosts has to do with IP addresses. I wonder if there are any cases in which it is preferred to skip it. Generally if the host looks like an IP address, it is meant to be one. (Though maybe not for legacy IP addresses such as e.g. It still feels really tempting for me to say that it would make sense to have “being a drive” be a property of a Also: Honestly, I’m not sure if this whole conversation is appropriate for this issue. I think it deviates from what the issue proposes, but it tries to tackle the same problems using a different approach. |
I created a new issue for this, which I think states the goal more clearly. And I started to look at decomposing the now 'opaque' modes into configs that can mix and match the desired behaviours. |
Currently URLs are modelled as sequences of tokens, where the token-values are percent-encoded.
In reurl, I have added a bit of information on the class that wraps around the URL: a flag that indicates if the token values are encoded or not.
I have been thinking for a while now, if it makes sense to add something similar to the specification. Not (just) for storing info about the percent-encoding, but as a way of distinguishing file, web, and generic-hierarchical and generic-non-hierarchical URLs.
The disadvantage is that it makes things more complex. But the advantage may be that it makes things more clear, instead :)
This is motivated by one of my ideas for an API that supports relative URLs: one based on these four 'sorts'.
In far most situations, it will be possible to coerce between these sorts implicitly, by only changing the wrapper.
Any ideas about this, @zamfofex, @TimothyGu perhaps?
A sketch for an API
It doesn't make much sense to allow relative non-hierarchical URLs, nor coercions between these and the other sorts.
Non-hierarchical URLs can be represented by a single class, say OpaqueURL.
For the other three there can be classes:
Then the class that is used will select the parser-mode. Unless you need parsing support for backslashes, or drive-letters, or wish to use domains instead of an opaque-host, you can just use a GenericReference.
All of them could be coerced (some possibly failing) to the existing URL class.
(The other option is to use one 'Href' class and store the sort-info within, and expose a number of coercion methods on the class).
On a theoretical level, this corresponds to switching from a single-sorted algebra of URLs, to an order-sorted model. OSA (Order Sorted Algebra) is perfect for this, and really beautiful too :)
I don't think it would be too hard to figure out the coercions.
The text was updated successfully, but these errors were encountered: