Skip to content

Commit

Permalink
Replaced obsolete JEP-12 with JEP-12-01. (#138)
Browse files Browse the repository at this point in the history
* Replace obsolete JEP-12 with JEP-12-01.

* Aligned with fixes made on JEP-12.

* Documented pattern for JEP# addendums.

* Updated meaning of `then currently defined`.

* Formatted JSON strings using double-quotes consistently.

* Improved wording around embedded backslash preservations.

Co-authored-by: Richard Gibson <[email protected]>

* Fixed incorrect reference to grammar rule.

Co-authored-by: Richard Gibson <[email protected]>

* Legacy syntax is officially deprecated and MUST be updated.

Co-authored-by: Richard Gibson <[email protected]>

* Legacy syntax is definitely deprecated.

Co-authored-by: Richard Gibson <[email protected]>

* Clarify literal rule more explicitly.

Co-authored-by: Richard Gibson <[email protected]>

* Deprecated means forbidden.

Co-authored-by: Richard Gibson <[email protected]>

* Consistently formatted JSON strings using double-quotes.

* Finally reverted to representing abstract strings with “ and ” curly quotes.

* Improved wording.

* Updated requirements for JEP-12 specific tests.

Co-authored-by: Richard Gibson <[email protected]>

* Renamed grammar rules.

Co-authored-by: Richard Gibson <[email protected]>

* Renamed addendum for better lexicographical sorting.

* Explained JEP# amendment letter sequence.

* Renamed and re-authored JEP-12a.

* Addressed further feedback from review.

* Improved wording introducing single-quoted strings.

Co-authored-by: Richard Gibson <[email protected]>

* Legacy syntax being deprecated, warning is no longer an option.

Co-authored-by: Richard Gibson <[email protected]>

* Update jep-012a-raw-string-literals.md

Co-authored-by: Richard Gibson <[email protected]>

* Update jep-012a-raw-string-literals.md

Co-authored-by: Richard Gibson <[email protected]>

* Update jep-012a-raw-string-literals.md

Co-authored-by: Richard Gibson <[email protected]>

* baz bar -> bar baz

Co-authored-by: Richard Gibson <[email protected]>

* Anchored `raw-string` to `expression`grammar rule.

Co-authored-by: Richard Gibson <[email protected]>

* Plural.

Co-authored-by: Richard Gibson <[email protected]>

* Clarifying preserved backslash character.

Co-authored-by: Richard Gibson <[email protected]>

* Using proper Unicode casing.

Co-authored-by: Richard Gibson <[email protected]>

* Proper Unicode casing (again)

Co-authored-by: Richard Gibson <[email protected]>

* Clarifying requirements around JEP-12 specific tests.

Co-authored-by: Richard Gibson <[email protected]>

Co-authored-by: Richard Gibson <[email protected]>
  • Loading branch information
springcomp and gibson042 authored Jan 17, 2023
1 parent 5174701 commit cd5bbb2
Show file tree
Hide file tree
Showing 3 changed files with 361 additions and 2 deletions.
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,15 @@ Then open a [discussion](https://github.com/jmespath-community/jmespath.spec/dis
Once the discussion has reached a mature level of feedback, it will be assigned a JEP number _N_.
At that point it is safe to begin composing JEP-_N_ along with all the required changes and documentation.

In rare circumstances, JEPs will be superceded by newer versions of a JEP. Those
updated JEPs will have an amended letter sequence added as a suffix to the original JEP#.
The sequence follows the 26 US-ASCII alphabetical order 'a', 'b', to 'z'. It then follows with
'aa', 'ab', to 'az' and so on.

For instance, [JEP-12](./jep-012-raw-string-literals.md) introduced `raw-string`
literals but the grammar defined there did not match that which was eventually specified. An update has been published as
[JEP-12a](./jep-012a-raw-string-literals.md).

### Tenets of JMESPath

When proposing new features, keep these tenets in mind. Adhering to these tenets gives your proposal a higher likelihood of being accepted:
Expand Down
4 changes: 2 additions & 2 deletions jep-012-raw-string-literals.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
| **Author** | Michael Dowling
| **Status** | accepted
| **Created**| 09-Apr-2015
| **Obsoleted By**| [JEP-12a](./jep-012a-raw-string-literals.md)

## Abstract

Expand Down Expand Up @@ -52,8 +53,7 @@ which produces the string literal value of `foo`:
`"foo"`
```

The following expression is functionally equivalent. Notice that the quotes are
elided from the JSON literal:
The following expression is functionally equivalent. Notice that the quotes are elided from the JSON literal:

```
`foo`
Expand Down
350 changes: 350 additions & 0 deletions jep-012a-raw-string-literals.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,350 @@
# Raw String Literals

|||
|---|---
| **JEP** | 12a
| **Author** | Michael Downling, Maxime Labelle, Richard Gibson
| **Status** | accepted
| **Created**| 19-Nov-2022
| **Obsoletes**| [JEP-12](./jep-012-raw-string-literals.md)

## Abstract

This JEP proposes the following modifications to JMESPath in order to improve
the usability of the language and ease the implementation of parsers:


* Addition of a **raw string literal** to JMESPath that will allow direct
expression of string contents that would otherwise be modified by
interpretation as JSON (e.g., `'\n'`, `'\r'`, `'\u005C'`).


* Deprecation of the current literal parsing behavior that allows for unquoted
JSON strings to be parsed as JSON strings, removing an ambiguity in the
JMESPath grammar and helping to ensure consistency among implementations.

This proposal seeks to add the following syntax to JMESPath:

```
'foobar'
'foo\'bar'
`bar` -> Parse error
```

## Motivation

Raw string literals are provided in [various programming languages](https://en.wikipedia.org/wiki/String_literal#Raw_strings) in order to prevent
language specific interpretation (i.e., JSON parsing) and remove the need for
escaping, avoiding a common problem called [leaning toothpick syndrome (LTS)](https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome). Leaning toothpick
syndrome is an issue in which strings become unreadable due to excessive use of
escape characters in order to avoid delimiter collision (e.g., `"\\\\\\"`).

When evaluating a JMESPath expression, it is often necessary to utilize string
literals that are not extracted from the data being evaluated, but rather
statically part of the compiled JMESPath expression. String literals are useful
in many areas, but most notably when invoking functions or building up
multi-select lists and hashes.

The following expression produces the three-character string “foo” using a `` `…` `` JSON text literal:

```
`"foo"`
```

### Obsolete alternative

_This section recapitulates content from the original version of JEP-12 that is no longer accurate as of this replacement._

The following expression is functionally equivalent. Notice that the quotes are
elided from the JSON literal:
```
`foo`
```

These string literals are parsed using a JSON parser according to
[RFC 4627](https://www.ietf.org/rfc/rfc4627.txt), which will expand unicode
escape sequences, newline characters, and several other escape sequences
documented in RFC 4627 section 2.5.

For example, the use of an escaped unicode value `\u002B` is expanded into
`+` in the following JMESPath expression:

```
`"foo\u002B"` -> "foo+"
```

You can escape escape sequences in JSON literals to prevent an escape sequence
from being expanded:

```
`"foo\\u002B"` -> "foo\u002B"
`foo\\u002B` -> "foo\u002B"
```

While this allows you to provide literal strings, it presents the following
problems:


1. Incurs an additional JSON parsing penalty.


2. Requires the cognitive overhead of escaping escape characters if you
actually want the data to be represented as it was literally provided
(which can lead to LTS). If the data being escaped was meant to be used
along with another language that uses `\` as an escape character, then the
number of backslash characters doubles.


3. Introduces an ambiguous rule to the JMESPath grammar that requires a prose
based specification to resolve the ambiguity in parser implementations.

The relevant literal grammar rules are currently defined as follows:

```abnf
literal = "`" json-value "`"
literal =/ "`" 1*(unescaped-literal / escaped-literal) "`"
unescaped-literal = %x20-21 / ; space !
%x23-5B / ; # - [
%x5D-5F / ; ] ^ _
%x61-7A ; a-z
%x7C-10FFFF ; |}~ ...
escaped-literal = escaped-char / (escape %x60)
json-value = false / null / true / json-object / json-array /
json-number / json-quoted-string
false = %x66.61.6c.73.65 ; false
null = %x6e.75.6c.6c ; null
true = %x74.72.75.65 ; true
json-quoted-string = %x22 1*(unescaped-literal / escaped-literal) %x22
begin-array = ws %x5B ws ; [ left square bracket
begin-object = ws %x7B ws ; { left curly bracket
end-array = ws %x5D ws ; ] right square bracket
end-object = ws %x7D ws ; } right curly bracket
name-separator = ws %x3A ws ; : colon
value-separator = ws %x2C ws ; , comma
ws = *(%x20 / ; Space
%x09 / ; Horizontal tab
%x0A / ; Line feed or New line
%x0D ; Carriage return
)
json-object = begin-object [ member *( value-separator member ) ] end-object
member = quoted-string name-separator json-value
json-array = begin-array [ json-value *( value-separator json-value ) ] end-array
json-number = [ minus ] int [ frac ] [ exp ]
decimal-point = %x2E ; .
digit1-9 = %x31-39 ; 1-9
e = %x65 / %x45 ; e E
exp = e [ minus / plus ] 1*DIGIT
frac = decimal-point 1*DIGIT
int = zero / ( digit1-9 *DIGIT )
minus = %x2D ; -
plus = %x2B ; +
zero = %x30 ; 0
```

The `literal` rule is ambiguous because `unescaped-literal` includes
all of the same characters that `json-value` match, allowing any value
that is valid JSON to be matched on either `unescaped-literal` or
`json-value`.

### Rationale

When implementing parsers for JMESPath, one must provide special case parsing
when parsing JSON literals due to the allowance of elided quotes around JSON
string literals (e.g., `` `foo` ``). This specific aspect of JMESPath cannot be
described unambiguously in a context free grammar and could become a common
cause of errors when implementing JMESPath parsers.

Parsing JSON literals has other complications as well. Here are the steps
needed to currently parse a JSON literal value in JMESPath:


1. When a `` ` `` token is encountered, begin parsing a JSON literal.


2. Collect each character between the opening `` ` `` and closing `` ` ``
tokens, including any escaped `` ` `` characters (i.e., ``\` ``) and store the
characters in a variable (let’s call it `$lexeme`).


3. Copy the contents of `$lexeme` to a temporary value in which all leading
and trailing whitespace is removed. Let’s call this `$temp` (this is
currently not documented but required in the
[JMESPath compliance tests](https://github.com/jmespath/jmespath.test/blob/c532a20e3bca635fb6ca248e5e955e1bd146a965/tests/syntax.json#L592-L606)).


4. If `$temp` can be parsed as valid JSON, then use the parsed result as the
value for the literal token.


5. If `$temp` cannot be parsed as valid JSON, then wrap the contents of
`$lexeme` in double quotes and parse the wrapped value as a JSON string,
making the following expressions equivalent: `` `foo` `` == `` `"foo"` ``, and
`` `[1, ]` `` == `` `"[1, ]"` ``.

It is reasonable to assume that the most common use case for a JSON literal in
a JMESPath expression is to provide a string value to a function argument or
to provide a literal string value to a value in a multi-select list or
multi-select hash. In order to make providing string values easier, it was
decided that JMESPath should allow the quotes around the string to be elided.

This proposal posits that allowing quotes to be elided when parsing JSON
literals should be prohibited in favor of the proper string literal syntax.

## Specification

A raw string literal is value that begins and ends with a single quote
and preserves embedded backslashes except those used to escape
backslash or single quote characters.

### Examples

Here are several examples of valid raw string literals and how they are
parsed:


* A basic raw string literal, representing the seven-character string “foo bar”:

```
'foo bar'
```


* A raw string literal with an escaped single quote, representing the seven-character string “foo'bar”:

```
'foo\'bar'
```


* A raw string literal with an escaped backslash character, representing the seven-character string “foo\bar”:

```
'foo\\bar'
```


* A raw string literal that contains new lines:

```
'foo
bar
baz!'
```

The above expression represents the multi-line string:

```
foo
bar
baz!
```


* A raw string literal that contains a preserved backslash character, representing the eight-character string “foo\nbar”:

```
'foo\nbar'
```

### ABNF

The following ABNF grammar rules will be added:

```abnf
expression =/ raw-string
raw-string = "'" *raw-string-char "'"
raw-string-char = (%x00-26 / ; ␀ through '&' (precedes U+0027 APOSTROPHE "'")
%x28-5B / ; '(' through '[' (precedes U+005C REVERSE SOLIDUS '\')
%x5D-10FFFF) / ; ']' and all following code points
preserved-escape /
raw-string-escape
preserved-escape = escape (
%x00-26 / ; ␀ through '&' (precedes U+0027 APOSTROPHE "'")
%x28-5B / ; '(' through '[' (precedes U+005C REVERSE SOLIDUS '\')
%x5D-10FFFF) ; ']' and all following code points
raw-string-escape = escape (
"'" / ; U+0027 APOSTROPHE "'"
escape) ; U+005C REVERSE SOLIDUS '\'
```

These rules allow any character inside of a raw string,
including control characters and escaped single quotes or backslashes.

In addition to adding a `raw-string` rule, the `literal` rule in the ABNF
will be simplified to become:

```
literal = "`" json-text "`"
```

## Impact

The impact to existing users of JMESPath is that the use of a JSON literal
in which the quotes are elided MUST be quoted or converted to use the
raw-string rule of the grammar.

To accommodate legacy JMESPath implementations, all of
the JSON literal compliance test cases that involve elided quotes MUST be
removed, and test cases regarding failing on invalid unquoted JSON values
MUST NOT be allowed in the compliance test unless placed in a JEP 12 specific
test suite, allowing such implementations to filter them out.

## Alternative approaches

There are several alternative approaches that could be taken.

### Leave as-is

This is a valid and reasonable suggestion. Leaving JMESPath as-is would avoid
a breaking change to the grammar and users could continue to use multiple
escape characters to avoid delimiter collision.

The goal of this proposal is not to add functionality to JMESPath, but rather
to make the language easier to use, easier to reason about, and easier to
implement. As it currently stands, the behavior of JSON parsing is ambiguous
and requires special casing when implementing a JMESPath parser. It also allows
for minor differences in implementations due to this ambiguity.

Take the following example:

```
`[1`
```

One implementation may interpret this expression as a JSON string with the
string value of `"[1"`, while other implementations may raise a parse error
because the first character of the expression appears to be valid JSON.

By updating the grammar to require valid JSON in the JSON literal token, we can
remove this ambiguity completely, removing a potential source of inconsistency
from the various JMESPath implementations.

### Disallow single quotes in a raw string

This proposal states that single quotes in a raw string literal must be escaped
with a backslash. An alternative approach could be to not allow single quotes
in a raw string literal. While this would simplify the `raw-string` grammar
rule, it would severely limit the usability of the `raw-string` rule, forcing
users to use the `literal` rule.

### Use a customizable delimiter

Several languages allow for a custom delimiter to be placed around a raw
string. For example, Lua allows for a [long bracket](https://www.lua.org/manual/5.2/manual.html#3.1) notation in which raw
strings are surrounded by `[[]]` with any number of balanced = characters
between the brackets:

```
[==[foo=bar]==] -- parsed as "foo=bar"
```

This approach is very flexible and removes the need to escape any characters;
however, this can not be expressed in a regular grammar. A parser would need to
keep track of the number of opened delimiters and ensure that it is closed with
the appropriate number of matching characters.

The addition of a string literal as described in this JEP does not preclude a
later addition of a heredoc or delimited style string literal as provided by
languages like Lua, [D](https://dlang.org/lex.html#DelimitedString),
[C++](https://en.wikipedia.org/wiki/C%2B%2B11#New_string_literals), etc…

0 comments on commit cd5bbb2

Please sign in to comment.