-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regex-syntax
error messages highlight incorrect characters/not handling graphemes correctly
#1132
Comments
Yeah indeed, this is an unfortunate artifact of the current implementation. I realized this when I wrote it and made exactly this trade off. Basically, I perceive three ways to fix this issue:
My philosophy on whether to bring in dependencies for this crate effectively rules out (2). I try to keep the dependency tree very small. We could go with (1), which means this sort of confusing case won't happen. But it makes error messages substantially worse (IMO) in the common case. Which I think leaves option (3). I'm open to going that route, but it will need to satisfy the following things:
I don't have any plans to work on this myself any time soon, so PRs are welcome. I would suggest posting a straw man implementation path before a PR so that we can have a meeting of the minds on implementation strategy. One can submit a PR first, but posting a more detailed proposal might help avoid wasting time. |
What about my last suggestion?
It would at least allow the user to diagnose the syntax error manually. This doesn't rule out working on a more robust solution like what you describe, but would be a big help with (I think) minimal effort. |
I would be open to a PR adding the "byte offset" phrasing. That's a decent idea because it's at least more precise. The "character" phrasing would, I believe, require grapheme segmentation to get correct. |
One thing I would like to add as somebody who spend way too much time on geapheme width: There is no such thing as a universally agreed upon width for graphene clusters. Unicode essentially shrugs and tells you to ask the font (really the notion of monospace width just doesn't make sense for some more exotic scripts where graphemes are important). There is a definition which is commonly used for terminals which the unicode-width crate implements. The problem is this is a per codepoint width definition not per-graphem. Terminal emulators disagree on the exact width (some do grapheme segmentation and break backwards compatability others don't, different emulators support different unicode versions, ...).
I am not sure what you policy is with dependencies exactly but it's worth noting that most downstream users that do care about rendering diagnostics probably pull in unicode-width for their own rendering anyway. To avoid having two different lut in the binary using the dependency could be preferable (behind an optional off by default feature flag since I imagine most users do not care). |
What version of regex are you using?
1.10.2 (latest)
Describe the bug at a high level.
Parsing a regex string with combined characters/grapheme clusters and an error causes the error message to highlight the wrong character. This can make the error impossible to accurately track down based on the error information.
What are the steps to reproduce the behavior?
Consider this program creating a
Regex
from a string with a grapheme cluster and then an error (in this case, invalid hex character in the unicode escape).The error is:
(playground link)
as you can see the
4
is highlighted, while the actual error occurs one grapheme earlier, at theG
. This is relatively benign, but note that grapheme clusters can cause the error arrow to move arbitrarily far in the stream, for example to an otherwise correct occurrence of a syntax element:(playground link)
What is the expected behavior?
An ideal solution would be to have the error arrow be aware of grapheme clusters and show up underneath the correct one for the user.
In absence of that, it would help to include the index of the character that caused the error in textual format. In the first example it'd say "character 7" or at least "byte index 10".
The text was updated successfully, but these errors were encountered: