Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Callouts #20

Open
RedCMD opened this issue Jan 5, 2025 · 6 comments
Open

Support Callouts #20

RedCMD opened this issue Jan 5, 2025 · 6 comments

Comments

@RedCMD
Copy link

RedCMD commented Jan 5, 2025

the easiest would just be (*FAIL)
which can just be (?!)

https://github.com/kkos/oniguruma/blob/master/doc/RE#L330-L357

https://github.com/kkos/oniguruma/blob/master/doc/CALLOUTS.BUILTIN

@slevithan
Copy link
Owner

slevithan commented Jan 5, 2025

Yeah, (*FAIL) and (*SKIP)(*FAIL) are specifically called out in the readme under Unsupported features as supportable.

  • The (*FAIL) transformation is of course trivial (I'd probably use (?!) like you suggested, but other good options in JS include [] and \b\B). However, it requires slightly more work in the tokenizer and parser. No built-in callouts are supported yet, so it needs a new AST node type to be visible from toOnigurumaAst.
    • This is low priority since essentially nobody uses (*FAIL). It's not found even once in this collection of tens of thousands of Oniguruma regexes. But pull requests are welcome.
  • Supporting (*SKIP)(*FAIL) (and (*SKIP)(?!)) would require special handling in the EmulatedRegExp subclass, in addition to transformations in the pattern itself.
    • This is low priority since nobody uses it yet (it's brand new, but I also don't expect much adoption). However, it has very nice use cases so I'm tempted to do the work anyway.

@slevithan
Copy link
Owner

I don't think any other callouts except (*FAIL) and the combo (*SKIP)(*FAIL) are supportable using native JS regexes, but then, the Oniguruma docs for built-in callouts are very limited and hard to interpret, so I might be misunderstanding them (I'm mostly relying on my understanding of PCRE/Perl backtracking control verbs to understand them, but they don't fully overlap). If you have ideas for how to emulate others, I'd love to hear them.

Fortunately, not supporting callouts is not much of a loss, since although the (*SKIP)(*FAIL) combo is very useful, others are generally not IMO (one of the reasons no one uses them).

@RedCMD
Copy link
Author

RedCMD commented Jan 5, 2025

I had been looking at #1
which is outdated now I guess 😅

yea (*SKIP) was only just released in v6.9.10 4days ago
I haven't used it yet, even tho I sort of requested it :) kkos/oniguruma#299
and prob wont ever since VSCode won't update :(

(*FAIL) basically doesn't even appear on GitHub
https://github.com/search?q=lang%3Ajson+%22%28*FAIL%29%22&type=code

(*COUNT) and (*CMP) can be used for keeping items in balance without needing recursion
like for:

\(\?
\{(?<g1>\{)?+(?<g2>\{)?+(?<g3>\{)?+(?<g4>\{)?+(?<g5>\{)?+
.+?
\}(?(<g1>)\})(?(<g2>)\})(?(<g3>)\})(?(<g4>)\})(?(<g5>)\})
\)

tho I have no idea how you could implement them in straight JS regex

@slevithan
Copy link
Owner

slevithan commented Jan 5, 2025

yea (*SKIP) was only just released in v6.9.10 4days ago

Yup! I updated the readme shortly after Oniguruma 6.9.10 was released. 😊

I haven't used it yet, even tho I sort of requested it :)

It's great for things like <[^>]+>(*SKIP)(?!)|\w+ (in this case matching words that aren't within angled brackets). You could pull this off with some convoluted lookaround logic, but this is much simpler and allows you to state what you mean. And the good news is that cases like this can be emulated in JS (with the help of a RegExp subclass).

(*COUNT) and (*CMP) can be used for keeping items in balance without needing recursion

Can you give an example or explain how they work? The docs for them are difficult to make sense of.

tho I have no idea how you could implement them in straight JS regex

Unless there's some nuance to them I don't yet understand, it's not possible even with the help of a RegExp subclass (unless you created an entirely new regex engine which would be massively worse than using native RegExp).

@RedCMD
Copy link
Author

RedCMD commented Jan 6, 2025

({(*COUNT[Tag1]{X}))++(}(*COUNT[Tag2]{X}))+(*CMP{Tag1,==,Tag2})
(*COUNT) counts the number of times it was successfully called and stores it in tag Tag1
{X} means to count successful calls only
(*CMP) compares the 2 tags and only succeeds when both match

(*MAX[Tag1]{5}) could be used instead if you wanted to limit it to 5 brackets max

seems I haven't implemented it fully in TreeSitter yet
image

@slevithan
Copy link
Owner

Nice explanation. Using *COUNT and *CMP in that way could be reproduced in .NET using "balancing groups" (which I wrote about many years ago here). It could maybe be reproduced in some other regex flavors that use the traditional behavior for backreferences to non-participating groups. But not in JS, where backreferences to non-participating groups match the empty string rather than failing to match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants