Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add like_regex predicate to JSON path #24616

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

piotrrzysko
Copy link
Member

Description

This PR introduces the like_regex predicate to JSON path. It aims to comply with the ISO/IEC 9075-2:2016(E) standard. The only feature defined in the standard that is not implemented here is the case-insensitive mode (the i flag). The reason for this is described in this issue: #24615.

Relevant links to the specifications referenced by the standard:

I want to highlight a specific part of the specification for reviewers to verify if my interpretation is correct.

The ISO/IEC 9075-2:2016(E) standard, in Section 9.22, specifies the following:

1) Let S be the STRING, let PAT be the PATTERN, let SP be the POSITION, let U be the UNITS, and let FL 
   be the FLAG in an application of the General Rules of this Subclause. The result of the application of 
   this Subclause is LOMV, which is returned as LIST.
2) Case:
    a) If the character repertoire of PAT is UCS, then let P be PAT.
    b) Otherwise, let P be the result of an implementation-defined transliteration of PAT to UCS.
3) Case:
    a) If the character repertoire of FL is UCS, then let F be FL.
    b) Otherwise, let F be the result of an implementation-defined transliteration of FL to UCS.
...

9) An XQuery regular expression match is defined as follows:
   Case:
   a) If the character repertoire of S is UCS, then an XQuery regular expression match is a match as defined
      by [XQueryFO] for the function fn:matches(), with F as the value of the parameter $flags, with
      the following modifications:
      ...
   b) Otherwise, the definition of XQuery regular expression match is implementation-defined.

Points 2 and 3 specify that if the provided arguments contain characters outside the UCS repertoire, they should be transliterated to UCS. My understanding is that if the arguments are valid UTF-8 sequences, they contain only characters within the UCS repertoire. Even if the sequence includes code points that do not have characters currently assigned to them, they are still part of the repertoire, as specified in Unicode Technical Report #17:

For the Unicode Standard, on the other hand, the repertoire is inherently open. Because Unicode is intended to be the universal encoding, any abstract character that ever could be encoded is potentially a member of the set to be encoded, whether that character is currently known or not.

Point 9 states that the definition of a regular expression match is implementation-defined when the character repertoire of the provided subject string is not UCS. The approach in this PR is similar to what we do for string functions: if the string is not valid UTF-8, the result is undefined.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Section
* Add like_regex predicate to the JSON path language. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Jan 2, 2025
@piotrrzysko piotrrzysko requested review from kasiafi and martint January 2, 2025 17:41
This commit adds support for XQuery regular expressions specified in:
- https://www.w3.org/TR/xmlschema-2/
- https://www.w3.org/TR/xquery-operators/
It includes extensions defined in the ISO/IEC 9075-2:2016(E) standard.

Due to the lack of an off-the-shelf regular expression engine supporting
the syntax specified in ISO/IEC 9075-2:2016(E), this commit introduces
a translator to Joni regular expressions
The purpose of this benchmark is to compare the performance disparity
between Joni and XQuerySqlRegex. This comparison is, in some scenarios,
unfair because the semantics of the exact same pattern may differ
between these two implementations. The main difference arises from how
each defines a line boundary. Specifically, XQuerySqlRegex includes the
two-character sequence CRLF as a line boundary and treats it as a single
character. This may, in some cases, result in significant performance
overhead.

Compilation:
Benchmark    (benchmarkCase)  Mode  Cnt      Score     Error  Units
Joni             SINGLE_LINE  avgt   90    588.958 ±   4.707  ns/op
Joni              MULTI_LINE  avgt   90    642.081 ±   4.851  ns/op
Joni                ANY_CHAR  avgt   90    312.904 ±   2.434  ns/op
Joni                 DOT_ALL  avgt   90    362.500 ±   2.941  ns/op
Joni       WHITESPACE_ESCAPE  avgt   90    420.033 ±   3.296  ns/op
XQuerySql        SINGLE_LINE  avgt   90   2262.583 ±  14.107  ns/op
XQuerySql         MULTI_LINE  avgt   90   3030.253 ±  22.030  ns/op
XQuerySql           ANY_CHAR  avgt   90    626.098 ±   5.144  ns/op
XQuerySql            DOT_ALL  avgt   90   1109.101 ±   8.525  ns/op
XQuerySql  WHITESPACE_ESCAPE  avgt   90   1290.943 ±  12.132  ns/op

Pattern matching:
Benchmark      (benchmarkCase)  Mode  Cnt      Score     Error  Units
Joni               SINGLE_LINE  avgt   90    669.933 ±   5.456  ns/op
Joni                MULTI_LINE  avgt   90    836.196 ±   6.131  ns/op
Joni                  ANY_CHAR  avgt   90    183.930 ±   1.261  ns/op
Joni                   DOT_ALL  avgt   90    154.698 ±   1.178  ns/op
Joni         WHITESPACE_ESCAPE  avgt   90    131.898 ±   0.974  ns/op
XQuerySql          SINGLE_LINE  avgt   90    967.832 ±  44.436  ns/op
XQuerySql           MULTI_LINE  avgt   90   8156.072 ±  63.472  ns/op
XQuerySql             ANY_CHAR  avgt   90    637.084 ±   5.667  ns/op
XQuerySql              DOT_ALL  avgt   90   1601.855 ±  30.025  ns/op
XQuerySql    WHITESPACE_ESCAPE  avgt   90    157.574 ±   1.313  ns/op
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

1 participant