Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Regex DSL #1302

Open
rlouf opened this issue Nov 30, 2024 · 2 comments
Open

Add Regex DSL #1302

rlouf opened this issue Nov 30, 2024 · 2 comments
Milestone

Comments

@rlouf
Copy link
Member

rlouf commented Nov 30, 2024

Why?

Regular expressions are a very compact DSL to generate DFAs, and can be intimidating at first. We cannot expect users to be fluent in this DSL.

How?

We can make this easier by designing a simple Python DSL that compiles into regular expressions. It can be as simple as defining functions that return a regex string:

def one_or_more(pattern: str):
    return f"({pattern})+"

This is in practice slightly more complex, as some characters need to be escaped in literal expressions (such as (), and we cannot expect users who are not familiar with regular expressions to know that. A solution would be to create a RegexStr object that abstracts this away, for instance:

def escape(string):
    if string == "(":
        return f"\{string}"
    if string == ")":
        return f"\{string}"
    else:
        return string


class RegexStr(str):
    def __add__(self, other):
        if isinstance(other, RegexStr):
            return RegexStr(f"{self}{other}")
        else:
            return RegexStr(f"{self}{escape(other)}")

    def __radd__(self, other):
        if isinstance(other, RegexStr):
            return RegexStr(f"{other}{self}")
        else:
            return RegexStr(f"{escape(other)}{self}")

    def __repr__(self):
        # return regex in string format
        pass


def exactly(quantity, *regex_strs):
    regex_str = ''.join(regex_strs)
    if regex_str == '':
        raise ValueError('regex_strs argument must have at least one nonblank value')
    return RegexStr(regex_str + '{' + str(quantity) + '}')

print("(" + exactly(3, "abc") + ")")

This is however a very rough design, and I am open to alternatives.

@rlouf rlouf added this to the 0.1 milestone Nov 30, 2024
@RobinPicard
Copy link
Contributor

RobinPicard commented Jan 11, 2025

To me this proposition has the advantage of being the most straightforward to use, but I'm wondering whether it's not too implicit and magical as the user would not necessarily understand from the example that their plain strings are being modified. I see 2 cases in which that would lead to unexpected behavior:

  • the user tries to mix up an already formatted regex and the output of one of our functions. For instance by doing "[a-zA-Z]+" + exactly(3, "abc"). In this case we would escape characters from their regex which would modify its behavior.
  • the user uses an f-string instead of additions or stores the output of our functions in a variable and then uses implicit concatenation to add it to other strings. For instance by doing f"({exactly(3, 'abc')})". I think that could happen too as, if the user does not understand that we are modifying their plain strings, then they have no reason to think that adding strings is not equivalent to using f-strings or implicit concatenation.

The alternatives I can think of have the disadvantage of being more wordy and of requiring the user to know more about the functions we propose.


1. We make them use a function for their literal expressions.

So that would look like:

lit("(") + exactly(3, "abc") + lit(")")

Advantages:

  • Less error-prone
  • It makes users understand that literal expressions cannot directly be used in regex from the examples

Disadvantages:

  • Harder to read
  • The user needs to do and understand more, so brings less value in a sense

2. We make them put the functions and the literal expressions they use in a root function.

So, for instance:

create_regex(
  "(",
  exactly(3, "abc"),
  ")"
)

Advantages:

  • Less error-prone as we could specify in the interface documentation of create_regex that all arguments must be either our functions or literal expressions (that we will escape for them)
  • No need to put literal expressions in a function as in alternative 1

Disadvantages:

  • Harder to read
  • Still need to handle the situation in which the user tries/wants to use an already formatted regex (maybe provide a function reg that would just make sure the string is not escaped as literal expressions)

What do you think?

@rlouf
Copy link
Member Author

rlouf commented Jan 11, 2025

For your first example we could turn RegexStr into an object (that we would probably rename) that contains the regex string so:

RegexStr("[a-zA-Z]+") + exactly(3, "abc")

This is fine design-wise: the common path is easy but it allows more sophisticated use cases. But this does not solve the issue with f-strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants