Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doubling of STATE_COUNT following dollar_quote feature merging #209

Open
antoineB opened this issue Oct 26, 2023 · 2 comments
Open

Doubling of STATE_COUNT following dollar_quote feature merging #209

antoineB opened this issue Oct 26, 2023 · 2 comments

Comments

@antoineB
Copy link
Contributor

Now (5660d80) the state_count = 13456 roughly doubling, the size of the binary is 1.8Mo previously 1.2Mo.

Either

index 69d0a1d..2421aa2 100644
--- a/grammar.js
+++ b/grammar.js
@@ -1216,7 +1216,7 @@ module.exports = grammar({
         $.keyword_as,
         alias($._dollar_quoted_string_start_tag, $.dollar_quote),
         $._function_body_statement,
-        optional(';'),
+        ';',
         alias($._dollar_quoted_string_end_tag, $.dollar_quote),
       ),
     ),

Or

index 69d0a1d..e8faa61 100644
--- a/grammar.js
+++ b/grammar.js
@@ -1217,7 +1217,7 @@ module.exports = grammar({
         alias($._dollar_quoted_string_start_tag, $.dollar_quote),
         $._function_body_statement,
         optional(';'),
-        alias($._dollar_quoted_string_end_tag, $.dollar_quote),
+        $.dollar_quote,
       ),
     ),

Fix the explosion state_count I don't know why.

@antoineB
Copy link
Contributor Author

I try to have the closer to what we have but unfortunately it did work, I suppose it is because $.dollar_quote and $._dollar_quoted_string_end_tag lex the same chars.

index 69d0a1d..606f042 100644
--- a/grammar.js
+++ b/grammar.js
@@ -21,6 +21,7 @@ module.exports = grammar({
     [$.between_expression, $.binary_expression],
     [$.time],
     [$.timestamp],
+    [$._function_body_rule_a, $._function_body_rule_b],
   ],
 
   precedences: $ => [
@@ -1212,13 +1213,39 @@ module.exports = grammar({
           $.literal
         ),
       ),
-      seq(
-        $.keyword_as,
-        alias($._dollar_quoted_string_start_tag, $.dollar_quote),
-        $._function_body_statement,
-        optional(';'),
-        alias($._dollar_quoted_string_end_tag, $.dollar_quote),
-      ),
+      // This should be one rule but it is decomposed into 2 rules (A, B) to
+      // avoid state parser increase.
+      //
+      // It will parse valid SQL, but rule B will not fail if there is unbalanced
+      // dollar_quote or if there is another other dollar_quote inside
+      // $._function_body_statement with the same tag as the external dollar_quote.
+      //
+      // In the long run the whole body function should be parser by another
+      // parser based on the language specified before.
+      //
+      // Rule A
+      $._function_body_rule_a,
+      // Rule B
+      $._function_body_rule_b,
+    ),
+
+    _function_body_rule_a: $ => seq(
+      $.keyword_as,
+      alias($._dollar_quoted_string_start_tag, $.dollar_quote),
+      $._function_body_statement,
+      ';', // This should be optional(';') but this will create a lot (~6000)
+      // of state from any ending of $.statement to
+      // $._dollar_quoted_string_start_tag.
+      alias($._dollar_quoted_string_end_tag, $.dollar_quote),
+    ),
+
+    _function_body_rule_b: $ => seq(
+      $.keyword_as,
+      $.dollar_quote,
+      $._function_body_statement,
+      // There is an optimisation of the compiler so it dosn't generate any
+      // ending of $.statement to $.dollar_quote possible state.
+      $.dollar_quote,
     ),
 
     function_language: $ => seq(

@antoineB
Copy link
Contributor Author

I forgot to capture one the conflicting rule:

index 69d0a1d..e6e3e21 100644
--- a/grammar.js
+++ b/grammar.js
@@ -21,6 +21,7 @@ module.exports = grammar({
     [$.between_expression, $.binary_expression],
     [$.time],
     [$.timestamp],
+    [$._function_body_a, $._function_body_b],
   ],
 
   precedences: $ => [
@@ -1180,6 +1181,29 @@ module.exports = grammar({
         ),
         $.keyword_end,
       ),
+      seq(
+        $.keyword_as,
+        alias(
+          choice(
+            $._single_quote_string,
+            $._double_quote_string,
+          ),
+          $.literal
+        ),
+      ),
+      $._function_body_a,
+      $._function_body_b,
+    ),
+    // This should be one rule but it is decomposed into 2 rules (A, B) to
+    // avoid state parser increase.
+    //
+    // It will parse valid SQL, but rule B will not fail if there is unbalanced
+    // dollar_quote or if there is another other dollar_quote inside
+    // $._function_body_statement with the same tag as the external dollar_quote.
+    //
+    // In the long run the whole body function should be parser by another
+    // parser based on the language specified before.
+    _function_body_a: $ => choice(
       seq(
         $.keyword_as,
         alias($._dollar_quoted_string_start_tag, $.dollar_quote),
@@ -1202,24 +1226,24 @@ module.exports = grammar({
         optional(';'),
         alias($._dollar_quoted_string_end_tag, $.dollar_quote),
       ),
-      seq(
-        $.keyword_as,
-        alias(
-          choice(
-            $._single_quote_string,
-            $._double_quote_string,
-          ),
-          $.literal
-        ),
-      ),
       seq(
         $.keyword_as,
         alias($._dollar_quoted_string_start_tag, $.dollar_quote),
         $._function_body_statement,
-        optional(';'),
+        ';', // This should be optional(';') but this will create a lot (~6000)
+             // of state from any ending of $.statement to
+             // $._dollar_quoted_string_start_tag.
         alias($._dollar_quoted_string_end_tag, $.dollar_quote),
       ),
     ),
+    _function_body_b: $ => seq(
+      $.keyword_as,
+      $.dollar_quote,
+      $._function_body_statement,
+      // There is an optimisation of the compiler so it dosn't generate any
+      // ending of $.statement to $.dollar_quote possible state.
+      $.dollar_quote,
+    ),
 
     function_language: $ => seq(
       $.keyword_language,

But that still doesn't work.
I suppose treesitter compiler make the assumption to do some optimisations that 2 tokens ($._dollar_quoted_string_end_tag and $.dollar_quote in our case) can't match the same chars.

And the solution described in the docs (https://tree-sitter.github.io/tree-sitter/creating-parsers#other-external-scanner-details) like

  externals: $ => [
    $._dollar_quoted_string_start_tag,
    $._dollar_quote
    $._dollar_quoted_string,
  ],

The two solutions for me are:

  1. replace optional(';') with ';'
  2. don't use $._dollar_quoted_string_start_tag and $._dollar_quoted_string_end_tag and replace then with $.dollar_quote

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant