Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix[lang]: fix encoding of string literals #3091

Merged

Conversation

charles-cooper
Copy link
Member

@charles-cooper charles-cooper commented Sep 12, 2022

What I did

fix #3088, fix #2799

How I did it

How to verify it

Commit message

this commit fixes bad runtime encoding of unicode strings. `parse_Str`
used the utility function `string_to_bytes`, which rejects characters
with value larger than 255 and otherwise produces the ascii encoding
of the string. the issue is that bytes in the range 128-255 specify
different characters in utf-8 than in ascii encodings, resulting in
different values at runtime than at compile-time.

this can be seen from differing compile-vs-runtime behavior of
`keccak256` (this example was provided in GH issue 3088):

```vyper
@external @view def compile_hash() -> bytes32:
    return keccak256("è")

@external @view def runtime_hash() -> bytes32:
    s: String[1] = "è" return keccak256(s)
```

this commit fixes and simplifies `parse_Str` by using python's
`str.encode()` builtin, which encodes using utf-8 by default. it also
increases strictness of string validation to reject bytes in the range
128-255, since in utf-8 these can encode multibyte characters, which we
reject in vyper (see more discussion in GH issue 2338).

Description for the changelog

Cute Animal Picture

Put a link to a cute animal picture inside the parenthesis-->

since 4a5866c, string literals have been encoded incorrectly (now
enshrined in the string_to_bytes function). this commit changes the
interpretation of string literals to just use the utf-8 encoding, which
matches other interpretations of string literals in constant folding.
reasoning:
- hex literals can no longer be interpreted as Bytes
- strings can have multi-character utf-8 encoding, which would result in
  a discrepancy between runtime and compile-time len() (since we do not
  do any utf-8 decoding at runtime).
@charles-cooper charles-cooper changed the title Fix/bytestring encoding fix: string encodings Sep 12, 2022
@codecov-commenter
Copy link

codecov-commenter commented Sep 12, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.00%. Comparing base (9b5523e) to head (3b81ba0).
Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3091   +/-   ##
=======================================
  Coverage   91.99%   92.00%           
=======================================
  Files         119      119           
  Lines       16673    16662   -11     
  Branches     2803     2801    -2     
=======================================
- Hits        15339    15330    -9     
+ Misses        916      915    -1     
+ Partials      418      417    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@charles-cooper charles-cooper added this to the v0.4.0 milestone Nov 8, 2023
@charles-cooper charles-cooper removed this from the v0.4.0 milestone Feb 13, 2024
@charles-cooper charles-cooper changed the title fix: string encodings fix[lang]: fix string encodings Dec 29, 2024
@charles-cooper charles-cooper changed the title fix[lang]: fix string encodings fix[lang]: fix encoding of string literals Dec 29, 2024
@charles-cooper charles-cooper marked this pull request as ready for review December 29, 2024 16:00
@@ -11,7 +11,7 @@
import warnings
from typing import Generic, List, TypeVar, Union

from vyper.exceptions import CompilerPanic, DecimalOverrideException, InvalidLiteral, VyperException
from vyper.exceptions import CompilerPanic, DecimalOverrideException, VyperException

Check notice

Code scanning / CodeQL

Cyclic import Note

Import of module
vyper.exceptions
begins an import cycle.
@charles-cooper charles-cooper added this to the v0.4.1 milestone Jan 2, 2025
@cyberthirst
Copy link
Collaborator

from what i understand the issue originated in how we encoded strings - different code sections used different encodings.

can we add an utility for encoding that would called whenever we need to encode something?

eg in keccak we now have

elif isinstance(value, vy_ast.Str):
            value = value.value.encode()

which relies on the default encoding - would be better to be explicit and specify the encoding and use the same function everywhere

@charles-cooper charles-cooper enabled auto-merge (squash) January 12, 2025 16:13
@charles-cooper charles-cooper merged commit 43259f8 into vyperlang:master Jan 12, 2025
158 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants