Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freeze/bad performance on escaped quotation in f-string #4520

Open
scherrsasrf opened this issue Nov 26, 2024 · 16 comments · May be fixed by #4536
Open

Freeze/bad performance on escaped quotation in f-string #4520

scherrsasrf opened this issue Nov 26, 2024 · 16 comments · May be fixed by #4536
Labels
F: strings Related to our handling of strings T: bug Something isn't working

Comments

@scherrsasrf
Copy link

Describe the bug
black hangs (or takes an extremely long time?) to process a file with a multiline f-string containing a bunch of escaped quotation marks.

To Reproduce
Run black against this file:

def testing():
    somevar = "some value"
    return f"""
        {somevar} the following line causes the issue:
         \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" 
    """

Will just hang without any output.

Expected behavior
Should run to completion within at most 1 second.

Environment

  • Black's version: 24.10.0 (was working with 24.4.2, probably introduced in 24.8.0)
  • OS and Python version: macOS/Python 3.13.0

Additional context
Extracting the quotes into a variable seems to circumvent the issue:

def testing():
    somevar = "some value"
    quotes = """\" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \""""
    return f"""
        {somevar} the following line causes the issue:
         {quotes} 
    """
@scherrsasrf scherrsasrf added the T: bug Something isn't working label Nov 26, 2024
@JelleZijlstra JelleZijlstra added the F: strings Related to our handling of strings label Nov 26, 2024
@MeGaGiGaGon
Copy link
Collaborator

MeGaGiGaGon commented Nov 26, 2024

The hang happens inside tokenization at this line endmatch = endprog.match(line)

At that point endprog is this regex ((?:\\N{|{{|\\"|"(?!"")|[^"{])*(?<!\\N){(?!{)|(?:\\.|"(?!"")|[^"\\])*""") which has catastrophic backtracking.

Why I think it catastrophically backtracks

Since the regex is pcre2 compatible, I can use regex101's debugger. It looks like the issue is since the first repeated group (?:\\N{|{{|\\"|"(?!"")|[^"{]) contains both "(?!"") and [^"{] as alternatives, it is guaranteed to match any sequence of characters as long as they don't contain a {. Thus this first repeat will always match up to the end of the string. After this, since the string doesn't contain {, the match for it will cause the entire regex to backtrack. This is already bad, with simple long strings taking hundreds of steps, but gets much worse when \"s are added. Every \ and " is a valid starting point for the first repeat, so each one makes the entire regex run again, with one less length, for (I think) O(nlog(n)) performance.

That regex got into endprog_stack at this line endprog_stack.append(endprog) from endprog = endprogs[token]

At that point token is f""", as expected.

endprogs[f"""] is generated at this line **{f'{prefix}"""': double3prog_plus_lbrace for prefix in _fstring_prefixes},

double3prog_plus_lbrace is generated by double3prog_plus_lbrace = re.compile(group(Double3Lbrace, Double3))

Double3Lbrace is the offending portion, being set here Double3Lbrace = r'(?:\\N{|{{|\\"|"(?!"")|[^"{])*(?<!\\N){(?!{)'

Notably this is commented with # beginning of a triple quoted f-string. must not end with {{ or \N{, so perhaps this is being misused to generate part of endprogs? Not sure of the original intent here.

It looks like single quoted f-strings don't add their same version of the regex to endprog_stack, so they don't hang. I'm not sure why there is that difference between single and triple quoted strings.

@JelleZijlstra
Copy link
Collaborator

Thanks for the analysis @MeGaGiGaGon! Also cc @tusharsadhwani.

@tusharsadhwani
Copy link
Collaborator

Thanks for the analysis! I'm guessing it is a backtracking bug as well. I'm hoping to reproduce the slowness in regex101, and look for a workaround that doesn't break anything.

@tusharsadhwani
Copy link
Collaborator

Reproduced: https://regex101.com/r/tmjlX1/1

Adding any more \" in there causes regex101 to bail out. Every new \" multiplies the backtrack count by 2.

@tusharsadhwani
Copy link
Collaborator

https://regex101.com/r/H8yqIO/1/debugger

And this debugger view shows what's going on: for every combination of number of \"'s present it tries to match the end sequence and fails, except at the very last one.

@MeGaGiGaGon
Copy link
Collaborator

I don't know why regex101 is weird about it, but if you put Double3Lbrace in a non-capturing group and add an |a after, it will show you the amount of steps needed to fail +1. Works both in PCRE2 mode and python. https://regex101.com/r/Nt4Wuh/1

@tusharsadhwani
Copy link
Collaborator

tusharsadhwani commented Dec 5, 2024

@JelleZijlstra while I'm debugging this, I do believe the tokenizer might be in need for a rewrite without using any regex, either in Python itself (if you think that won't be too slow), or in another langauge (which can also bring a nice perf. boost)

I have already written a tokenizer that is written in Zig and is much faster, passes all tests in black, and it can be turned into a Python package. Would you consider adopting it into black?

Or if not, would porting the exact same logic (it iterates over each character with almost no backtracking) into Python be too slow?

@JelleZijlstra
Copy link
Collaborator

Thanks for looking into this! If it works well, I'd consider switching to your tokenizer, but dependencies that need compilation can make it harder for users to use Black. In the past, that was the motivation for us to switch from regex to re. I'm not familiar with Zig at all; I assume you'd provide wheels for common platforms, but for people not covered by those wheels, it might be even harder to use the package as they'd need a Zig compiler.

I'm also not convinced this issue is a reason to throw out the entire current tokenizer. It seems it should be possible to write a local fix for this issue.

@tusharsadhwani
Copy link
Collaborator

It seems it should be possible to write a local fix for this issue.

Correct. I'm doing that today hopefully.

it might be even harder to use the package as they'd need a Zig compiler.

That's correct. Alternative would be to do a naïve Python port of the logic, making it much simpler, but without using any regex at all. Have you looked into that at all? Do you think it'll be a major performance hit?

@tusharsadhwani
Copy link
Collaborator

Understood the backtracking problem. Compare these two:

https://regex101.com/r/SPaNxh/1

vs.

https://regex101.com/r/o24Ekt/1

First one is linear time, second one is exponential time for the number of \" and \N{ present. Second one just excludes matches that end in {{. I'll try to move the {{ exclusion logic to Python instead, should fix the bug.

@JelleZijlstra
Copy link
Collaborator

Alternative would be to do a naïve Python port of the logic, making it much simpler, but without using any regex at all. Have you looked into that at all? Do you think it'll be a major performance hit?

I haven't tried that. I don't have a good intuition as to what the performance impact would be.

@tusharsadhwani
Copy link
Collaborator

Alright. I'll do a simple port and benchmark it over the weekend.

@tusharsadhwani
Copy link
Collaborator

The plan for the fix right now is:

  • Change the Regex to get rid of the { at the end:

     # beginning of a single quoted f-string. must not end with `{{` or `\N{`
    -SingleLbrace = r"(?:\\N{|{{|\\'|[^\n'{])*(?<!\\N)({)(?!{)"
    -DoubleLbrace = r'(?:\\N{|{{|\\"|[^\n"{])*(?<!\\N)({)(?!{)'
    +SingleLbrace = r"(?:\\N{|{{|\\'|[^\n'{])*(?<!\\N)"
    +DoubleLbrace = r'(?:\\N{|{{|\\"|[^\n"{])*(?<!\\N)'
     
     # beginning of a triple quoted f-string. must not end with `{{` or `\N{`
    -Single3Lbrace = r"(?:\\N{|{{|\\'|'(?!'')|[^'{])*(?<!\\N){(?!{)"
    -Double3Lbrace = r'(?:\\N{|{{|\\"|"(?!"")|[^"{])*(?<!\\N){(?!{)'
    +Single3Lbrace = r"(?:\\N{|{{|\\'|'(?!'')|[^'{])*(?<!\\N)"
    +Double3Lbrace = r'(?:\\N{|{{|\\"|"(?!"")|[^"{])*(?<!\\N)'
  • Add a check wherever we use the token to find an FSTRING_MIDDLE, if the token is followed by a {{, to save some state and try to find the next possible match just after the {{ instead.

@tusharsadhwani
Copy link
Collaborator

tusharsadhwani commented Dec 18, 2024

Okay I tried attempting this and the states are too complicated and hard to work with right now.

In other news, my Python 3.12 tokenizer rewrite is currently passing >99.5% of black's test suite, so that's probably good?

I tried running my tokenizer on every single Python file in the dependency tree of one of my projects, and it accurately tokenized 98% of the files (3849 / 3902):

Screenshot 2024-12-19 at 2 42 29 AM

And this will probably be zero dependencies and only some 400 lines of pure Python code.

@tusharsadhwani
Copy link
Collaborator

Screenshot 2024-12-20 at 1 49 47 AM

Tokenizer parity is at 100%.

@tusharsadhwani
Copy link
Collaborator

I have a put up a rough version here:

https://github.com/tusharsadhwani/pytokens

This is a quick port from Zig to Python, and I may have made small errors. I'll check tomorrow and try to throw it into Black.

@tusharsadhwani tusharsadhwani linked a pull request Dec 22, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
F: strings Related to our handling of strings T: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants