diff --git a/README.md b/README.md index effbd6b..0763c7a 100644 --- a/README.md +++ b/README.md @@ -36,6 +36,15 @@ Then open a [discussion](https://github.com/jmespath-community/jmespath.spec/dis Once the discussion has reached a mature level of feedback, it will be assigned a JEP number _N_. At that point it is safe to begin composing JEP-_N_ along with all the required changes and documentation. +In rare circumstances, JEPs will be superceded by newer versions of a JEP. Those +updated JEPs will have an amended letter sequence added as a suffix to the original JEP#. +The sequence follows the 26 US-ASCII alphabetical order 'a', 'b', to 'z'. It then follows with +'aa', 'ab', to 'az' and so on. + +For instance, [JEP-12](./jep-012-raw-string-literals.md) introduced `raw-string` +literals but the grammar defined there did not match that which was eventually specified. An update has been published as +[JEP-12a](./jep-012a-raw-string-literals.md). + ### Tenets of JMESPath When proposing new features, keep these tenets in mind. Adhering to these tenets gives your proposal a higher likelihood of being accepted: diff --git a/jep-012-raw-string-literals.md b/jep-012-raw-string-literals.md index 3540c1f..c3e5665 100644 --- a/jep-012-raw-string-literals.md +++ b/jep-012-raw-string-literals.md @@ -6,6 +6,7 @@ | **Author** | Michael Dowling | **Status** | accepted | **Created**| 09-Apr-2015 +| **Obsoleted By**| [JEP-12a](./jep-012a-raw-string-literals.md) ## Abstract @@ -52,8 +53,7 @@ which produces the string literal value of `foo`: `"foo"` ``` -The following expression is functionally equivalent. Notice that the quotes are -elided from the JSON literal: +The following expression is functionally equivalent. Notice that the quotes are elided from the JSON literal: ``` `foo` diff --git a/jep-012a-raw-string-literals.md b/jep-012a-raw-string-literals.md new file mode 100644 index 0000000..5c205a2 --- /dev/null +++ b/jep-012a-raw-string-literals.md @@ -0,0 +1,350 @@ +# Raw String Literals + +||| +|---|--- +| **JEP** | 12a +| **Author** | Michael Downling, Maxime Labelle, Richard Gibson +| **Status** | accepted +| **Created**| 19-Nov-2022 +| **Obsoletes**| [JEP-12](./jep-012-raw-string-literals.md) + +## Abstract + +This JEP proposes the following modifications to JMESPath in order to improve +the usability of the language and ease the implementation of parsers: + + +* Addition of a **raw string literal** to JMESPath that will allow direct +expression of string contents that would otherwise be modified by +interpretation as JSON (e.g., `'\n'`, `'\r'`, `'\u005C'`). + + +* Deprecation of the current literal parsing behavior that allows for unquoted +JSON strings to be parsed as JSON strings, removing an ambiguity in the +JMESPath grammar and helping to ensure consistency among implementations. + +This proposal seeks to add the following syntax to JMESPath: + +``` +'foobar' +'foo\'bar' +`bar` -> Parse error +``` + +## Motivation + +Raw string literals are provided in [various programming languages](https://en.wikipedia.org/wiki/String_literal#Raw_strings) in order to prevent +language specific interpretation (i.e., JSON parsing) and remove the need for +escaping, avoiding a common problem called [leaning toothpick syndrome (LTS)](https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome). Leaning toothpick +syndrome is an issue in which strings become unreadable due to excessive use of +escape characters in order to avoid delimiter collision (e.g., `"\\\\\\"`). + +When evaluating a JMESPath expression, it is often necessary to utilize string +literals that are not extracted from the data being evaluated, but rather +statically part of the compiled JMESPath expression. String literals are useful +in many areas, but most notably when invoking functions or building up +multi-select lists and hashes. + +The following expression produces the three-character string “foo” using a `` `…` `` JSON text literal: + +``` +`"foo"` +``` + +### Obsolete alternative + +_This section recapitulates content from the original version of JEP-12 that is no longer accurate as of this replacement._ + +The following expression is functionally equivalent. Notice that the quotes are +elided from the JSON literal: +``` +`foo` +``` + +These string literals are parsed using a JSON parser according to +[RFC 4627](https://www.ietf.org/rfc/rfc4627.txt), which will expand unicode +escape sequences, newline characters, and several other escape sequences +documented in RFC 4627 section 2.5. + +For example, the use of an escaped unicode value `\u002B` is expanded into +`+` in the following JMESPath expression: + +``` +`"foo\u002B"` -> "foo+" +``` + +You can escape escape sequences in JSON literals to prevent an escape sequence +from being expanded: + +``` +`"foo\\u002B"` -> "foo\u002B" +`foo\\u002B` -> "foo\u002B" +``` + +While this allows you to provide literal strings, it presents the following +problems: + + +1. Incurs an additional JSON parsing penalty. + + +2. Requires the cognitive overhead of escaping escape characters if you +actually want the data to be represented as it was literally provided +(which can lead to LTS). If the data being escaped was meant to be used +along with another language that uses `\` as an escape character, then the +number of backslash characters doubles. + + +3. Introduces an ambiguous rule to the JMESPath grammar that requires a prose +based specification to resolve the ambiguity in parser implementations. + +The relevant literal grammar rules are currently defined as follows: + +```abnf +literal = "`" json-value "`" +literal =/ "`" 1*(unescaped-literal / escaped-literal) "`" +unescaped-literal = %x20-21 / ; space ! + %x23-5B / ; # - [ + %x5D-5F / ; ] ^ _ + %x61-7A ; a-z + %x7C-10FFFF ; |}~ ... +escaped-literal = escaped-char / (escape %x60) +json-value = false / null / true / json-object / json-array / + json-number / json-quoted-string +false = %x66.61.6c.73.65 ; false +null = %x6e.75.6c.6c ; null +true = %x74.72.75.65 ; true +json-quoted-string = %x22 1*(unescaped-literal / escaped-literal) %x22 +begin-array = ws %x5B ws ; [ left square bracket +begin-object = ws %x7B ws ; { left curly bracket +end-array = ws %x5D ws ; ] right square bracket +end-object = ws %x7D ws ; } right curly bracket +name-separator = ws %x3A ws ; : colon +value-separator = ws %x2C ws ; , comma +ws = *(%x20 / ; Space + %x09 / ; Horizontal tab + %x0A / ; Line feed or New line + %x0D ; Carriage return + ) +json-object = begin-object [ member *( value-separator member ) ] end-object +member = quoted-string name-separator json-value +json-array = begin-array [ json-value *( value-separator json-value ) ] end-array +json-number = [ minus ] int [ frac ] [ exp ] +decimal-point = %x2E ; . +digit1-9 = %x31-39 ; 1-9 +e = %x65 / %x45 ; e E +exp = e [ minus / plus ] 1*DIGIT +frac = decimal-point 1*DIGIT +int = zero / ( digit1-9 *DIGIT ) +minus = %x2D ; - +plus = %x2B ; + +zero = %x30 ; 0 +``` + +The `literal` rule is ambiguous because `unescaped-literal` includes +all of the same characters that `json-value` match, allowing any value +that is valid JSON to be matched on either `unescaped-literal` or +`json-value`. + +### Rationale + +When implementing parsers for JMESPath, one must provide special case parsing +when parsing JSON literals due to the allowance of elided quotes around JSON +string literals (e.g., `` `foo` ``). This specific aspect of JMESPath cannot be +described unambiguously in a context free grammar and could become a common +cause of errors when implementing JMESPath parsers. + +Parsing JSON literals has other complications as well. Here are the steps +needed to currently parse a JSON literal value in JMESPath: + + +1. When a `` ` `` token is encountered, begin parsing a JSON literal. + + +2. Collect each character between the opening `` ` `` and closing `` ` `` +tokens, including any escaped `` ` `` characters (i.e., ``\` ``) and store the +characters in a variable (let’s call it `$lexeme`). + + +3. Copy the contents of `$lexeme` to a temporary value in which all leading +and trailing whitespace is removed. Let’s call this `$temp` (this is +currently not documented but required in the +[JMESPath compliance tests](https://github.com/jmespath/jmespath.test/blob/c532a20e3bca635fb6ca248e5e955e1bd146a965/tests/syntax.json#L592-L606)). + + +4. If `$temp` can be parsed as valid JSON, then use the parsed result as the +value for the literal token. + + +5. If `$temp` cannot be parsed as valid JSON, then wrap the contents of +`$lexeme` in double quotes and parse the wrapped value as a JSON string, +making the following expressions equivalent: `` `foo` `` == `` `"foo"` ``, and +`` `[1, ]` `` == `` `"[1, ]"` ``. + +It is reasonable to assume that the most common use case for a JSON literal in +a JMESPath expression is to provide a string value to a function argument or +to provide a literal string value to a value in a multi-select list or +multi-select hash. In order to make providing string values easier, it was +decided that JMESPath should allow the quotes around the string to be elided. + +This proposal posits that allowing quotes to be elided when parsing JSON +literals should be prohibited in favor of the proper string literal syntax. + +## Specification + +A raw string literal is value that begins and ends with a single quote +and preserves embedded backslashes except those used to escape +backslash or single quote characters. + +### Examples + +Here are several examples of valid raw string literals and how they are +parsed: + + +* A basic raw string literal, representing the seven-character string “foo bar”: + +``` +'foo bar' +``` + + +* A raw string literal with an escaped single quote, representing the seven-character string “foo'bar”: + +``` +'foo\'bar' +``` + + +* A raw string literal with an escaped backslash character, representing the seven-character string “foo\bar”: + +``` +'foo\\bar' +``` + + +* A raw string literal that contains new lines: + +``` +'foo +bar +baz!' +``` + +The above expression represents the multi-line string: + +``` +foo +bar +baz! +``` + + +* A raw string literal that contains a preserved backslash character, representing the eight-character string “foo\nbar”: + +``` +'foo\nbar' +``` + +### ABNF + +The following ABNF grammar rules will be added: + +```abnf +expression =/ raw-string +raw-string = "'" *raw-string-char "'" +raw-string-char = (%x00-26 / ; ␀ through '&' (precedes U+0027 APOSTROPHE "'") + %x28-5B / ; '(' through '[' (precedes U+005C REVERSE SOLIDUS '\') + %x5D-10FFFF) / ; ']' and all following code points + preserved-escape / + raw-string-escape +preserved-escape = escape ( + %x00-26 / ; ␀ through '&' (precedes U+0027 APOSTROPHE "'") + %x28-5B / ; '(' through '[' (precedes U+005C REVERSE SOLIDUS '\') + %x5D-10FFFF) ; ']' and all following code points +raw-string-escape = escape ( + "'" / ; U+0027 APOSTROPHE "'" + escape) ; U+005C REVERSE SOLIDUS '\' + +``` + +These rules allow any character inside of a raw string, +including control characters and escaped single quotes or backslashes. + +In addition to adding a `raw-string` rule, the `literal` rule in the ABNF +will be simplified to become: + +``` +literal = "`" json-text "`" +``` + +## Impact + +The impact to existing users of JMESPath is that the use of a JSON literal +in which the quotes are elided MUST be quoted or converted to use the +raw-string rule of the grammar. + +To accommodate legacy JMESPath implementations, all of +the JSON literal compliance test cases that involve elided quotes MUST be +removed, and test cases regarding failing on invalid unquoted JSON values +MUST NOT be allowed in the compliance test unless placed in a JEP 12 specific +test suite, allowing such implementations to filter them out. + +## Alternative approaches + +There are several alternative approaches that could be taken. + +### Leave as-is + +This is a valid and reasonable suggestion. Leaving JMESPath as-is would avoid +a breaking change to the grammar and users could continue to use multiple +escape characters to avoid delimiter collision. + +The goal of this proposal is not to add functionality to JMESPath, but rather +to make the language easier to use, easier to reason about, and easier to +implement. As it currently stands, the behavior of JSON parsing is ambiguous +and requires special casing when implementing a JMESPath parser. It also allows +for minor differences in implementations due to this ambiguity. + +Take the following example: + +``` +`[1` +``` + +One implementation may interpret this expression as a JSON string with the +string value of `"[1"`, while other implementations may raise a parse error +because the first character of the expression appears to be valid JSON. + +By updating the grammar to require valid JSON in the JSON literal token, we can +remove this ambiguity completely, removing a potential source of inconsistency +from the various JMESPath implementations. + +### Disallow single quotes in a raw string + +This proposal states that single quotes in a raw string literal must be escaped +with a backslash. An alternative approach could be to not allow single quotes +in a raw string literal. While this would simplify the `raw-string` grammar +rule, it would severely limit the usability of the `raw-string` rule, forcing +users to use the `literal` rule. + +### Use a customizable delimiter + +Several languages allow for a custom delimiter to be placed around a raw +string. For example, Lua allows for a [long bracket](https://www.lua.org/manual/5.2/manual.html#3.1) notation in which raw +strings are surrounded by `[[]]` with any number of balanced = characters +between the brackets: + +``` +[==[foo=bar]==] -- parsed as "foo=bar" +``` + +This approach is very flexible and removes the need to escape any characters; +however, this can not be expressed in a regular grammar. A parser would need to +keep track of the number of opened delimiters and ensure that it is closed with +the appropriate number of matching characters. + +The addition of a string literal as described in this JEP does not preclude a +later addition of a heredoc or delimited style string literal as provided by +languages like Lua, [D](https://dlang.org/lex.html#DelimitedString), +[C++](https://en.wikipedia.org/wiki/C%2B%2B11#New_string_literals), etc…