-
-
Notifications
You must be signed in to change notification settings - Fork 31k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C API] Change PyUnicode_AsUTF8() to return NULL on embedded null characters #111089
Comments
* PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * PyUnicode_AsUTF8AndSize() now sets `*size` to 0 on error to avoid undefined variable value. * Update related C API tests (test_capi.test_unicode).
* PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * PyUnicode_AsUTF8AndSize() now sets `*size` to 0 on error to avoid undefined variable value. * Update related C API tests (test_capi.test_unicode). * type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes.
On error, PyUnicode_AsUTF8AndSize() now sets the size argument to 0, to avoid undefined value.
* PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * Update related C API tests (test_capi.test_unicode). * type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes.
On error, PyUnicode_AsUTF8AndSize() now sets the size argument to -1, to avoid undefined value.
* PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * Update related C API tests (test_capi.test_unicode). * type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes. Co-authored-by: Serhiy Storchaka <[email protected]>
Add PyUnicode_AsUTF8() function to the limited C API. multiprocessing posixshmem now uses PyUnicode_AsUTF8() instead of PyUnicode_AsUTF8AndSize(): the extension is built with the limited C API. The function now raises an exception if the filename contains an embedded null character instead of truncating silently the filename.
PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters.
Add PyUnicode_AsUTF8() function to the limited C API. multiprocessing posixshmem now uses PyUnicode_AsUTF8() instead of PyUnicode_AsUTF8AndSize(): the extension is built with the limited C API. The function now raises an exception if the filename contains an embedded null character instead of truncating silently the filename.
On error, PyUnicode_AsUTF8AndSize() now sets the size argument to -1, to avoid undefined value.
PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters.
I added PyUnicode_AsUTF8() to the limited C API and it now raises an exception if the string contains embedded null characters. |
This makes |
If it's a performance issue, we can cache the "contains embedded null characters/bytes" information in PyASCIIObject structure.
Would you mind to elaborate?
It's a different API. In C, In C, it's rare that C code uses a length with a |
OK! Will you do that?
The API that existed with this name since before Python 3.0 is -- as you claim -- unsafe. The new semantics are not proven; usually we wait before the API stabilizes before declaring it stable for the long term. |
Replace PyUnicode_AsUTF8AndSize() with PyUnicode_AsUTF8() to remove the explicit check for embedded null characters. The change avoids to have to include explicitly <string.h> to get the strlen() function when using a recent version of the limited C API.
Add PyASCIIObject.state.embed_null member to Python str objects. It is used as a cache by PyUnicode_AsUTF8() to only check once if a string contains a null character. Strings created by PyUnicode_FromString() initializes *embed_null* since the string cannot contain a null character. Global static strings now also initialize the *embed_null* member. The chr(0) singleton ("\0" string) is the only static string which contains a null character.
Add PyASCIIObject.state.embed_null member to Python str objects. It is used as a cache by PyUnicode_AsUTF8() to only check once if a string contains a null character. Strings created by PyUnicode_FromString() initializes *embed_null* since the string cannot contain a null character. Global static strings now also initialize the *embed_null* member. The chr(0) singleton ("\0" string) is the only static string which contains a null character.
Add PyASCIIObject.state.embed_null member to Python str objects. It is used as a cache by PyUnicode_AsUTF8() to only check once if a string contains a null character. Strings created by PyUnicode_FromString() initializes *embed_null* since the string cannot contain a null character. Global static strings now also initialize the *embed_null* member. The chr(0) singleton ("\0" string) is the only static string which contains a null character.
Fix unicode_subtype_new
Revert PyUnicode_AsUTF8() change: it no longer rejects embedded null characters: the PyUnicode_AsUTF8Safe() function should be used instead.
…ython#111585)" This reverts commit d9b606b.
python#111091)" This reverts commit d731579.
…ython#111121)" This reverts commit d8f32be.
* Revert "gh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (#111585)" This reverts commit d9b606b. * Revert "gh-111089: Use PyUnicode_AsUTF8() in getargs.c (#111620)" This reverts commit cde1071. * Revert "gh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (#111091)" This reverts commit d731579. * Revert "gh-111089: Add PyUnicode_AsUTF8() to the limited C API (#111121)" This reverts commit d8f32be. * Revert "gh-111089: Use PyUnicode_AsUTF8() in sqlite3 (#111122)" This reverts commit 37e4e20.
There seems to be multiple disagreements on multiple things around this issue:
When I made these changes, they were approved by my peers and I thought that we were all on the same page on embedded null characters, since Python is already rejecting them in many places such as PyUnicode_AsWideCharString() for many years. But well, then read the discussions for details about disagreements. I reverted all changes (except of PR #111106 which is unrelated) in PR #111833. I close the issue for now. |
For reference I did attempt a test run of our huge "everything" style codebase at work over the weekend with That various things depend on the current (and now restored, thank you!) behavior, regardless of what any of us think of that, was what I wanted it to show. It did. :) |
Any idea what is their use case for embedded null characters? How are such strings build? By decoding bytes from UTF-8? |
Thanks for that feedback/test, it's very useful 👍 |
* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (python#111585)" This reverts commit d9b606b. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in getargs.c (python#111620)" This reverts commit cde1071. * Revert "pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (python#111091)" This reverts commit d731579. * Revert "pythongh-111089: Add PyUnicode_AsUTF8() to the limited C API (python#111121)" This reverts commit d8f32be. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in sqlite3 (python#111122)" This reverts commit 37e4e20.
Thank you for the revert. Sorry that the issues were only discovered after merge, and after such heated discussion. Maybe there's some process improvement to be made. I filed this issue for the C-API-WG-to-be: capi-workgroup/api-evolution#39 |
…n#111091) * PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * Update related C API tests (test_capi.test_unicode). * type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes. Co-authored-by: Serhiy Storchaka <[email protected]>
…111121) Add PyUnicode_AsUTF8() function to the limited C API. multiprocessing posixshmem now uses PyUnicode_AsUTF8() instead of PyUnicode_AsUTF8AndSize(): the extension is built with the limited C API. The function now raises an exception if the filename contains an embedded null character instead of truncating silently the filename.
…#111106) On error, PyUnicode_AsUTF8AndSize() now sets the size argument to -1, to avoid undefined value.
PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters.
…1585) Replace PyUnicode_AsUTF8AndSize() with PyUnicode_AsUTF8() to remove the explicit check for embedded null characters. The change avoids to have to include explicitly <string.h> to get the strlen() function when using a recent version of the limited C API.
Replace PyUnicode_AsUTF8AndSize() with PyUnicode_AsUTF8() to remove the explicit check for embedded null characters.
* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (python#111585)" This reverts commit d9b606b. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in getargs.c (python#111620)" This reverts commit cde1071. * Revert "pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (python#111091)" This reverts commit d731579. * Revert "pythongh-111089: Add PyUnicode_AsUTF8() to the limited C API (python#111121)" This reverts commit d8f32be. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in sqlite3 (python#111122)" This reverts commit 37e4e20.
…n#111091) * PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters. * Update related C API tests (test_capi.test_unicode). * type_new_set_doc() uses PyUnicode_AsUTF8AndSize() to silently truncate doc containing null bytes. Co-authored-by: Serhiy Storchaka <[email protected]>
…111121) Add PyUnicode_AsUTF8() function to the limited C API. multiprocessing posixshmem now uses PyUnicode_AsUTF8() instead of PyUnicode_AsUTF8AndSize(): the extension is built with the limited C API. The function now raises an exception if the filename contains an embedded null character instead of truncating silently the filename.
…#111106) On error, PyUnicode_AsUTF8AndSize() now sets the size argument to -1, to avoid undefined value.
PyUnicode_AsUTF8() now raises an exception if the string contains embedded null characters.
…1585) Replace PyUnicode_AsUTF8AndSize() with PyUnicode_AsUTF8() to remove the explicit check for embedded null characters. The change avoids to have to include explicitly <string.h> to get the strlen() function when using a recent version of the limited C API.
Replace PyUnicode_AsUTF8AndSize() with PyUnicode_AsUTF8() to remove the explicit check for embedded null characters.
* Revert "pythongh-111089: Use PyUnicode_AsUTF8() in Argument Clinic (python#111585)" This reverts commit d9b606b. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in getargs.c (python#111620)" This reverts commit cde1071. * Revert "pythongh-111089: PyUnicode_AsUTF8() now raises on embedded NUL (python#111091)" This reverts commit d731579. * Revert "pythongh-111089: Add PyUnicode_AsUTF8() to the limited C API (python#111121)" This reverts commit d8f32be. * Revert "pythongh-111089: Use PyUnicode_AsUTF8() in sqlite3 (python#111122)" This reverts commit 37e4e20.
I propose to change the
PyUnicode_AsUTF8()
API to raise an exception and return NULL if the string contains embedded null characters.If the string contains an embedded null character, the UTF-8 encoded string can be truncated if used with C functions using
char*
since a null byte is treated as the terminator: marker of the string end. Truncating a string silently is a bad practice and can lead to different bugs including security vulnerabilities.In practice, the minority of impacted C extensions and impacted users should benefit of such backward incompatible change, since truncating a string silently is a bad practice. Impacted users can use
PyUnicode_AsUTF8AndSize(obj, NULL)
and just ignore the size if they want to truncate on purpose.It would address the following "hidden" comment on PyUnicode_AsUTF8():
PyUnicode_AsUTF8String() is part of the limited C API, whereas PyUnicode_AsUTF8() is not.
In the recently added PyUnicode_EqualToUTF8(obj, str), str is treated as not equal if obj contains embedded null characters.
The folllowing functions already raise an exception if the string contains embedded null characters or bytes:
PyUnicode_AsUTF8String() returns a bytes object and so the length, so it doesn't raise the exception.
PyUnicode_AsUTF8AndSize() also returns the size and so don't raise on embedded null characters.
Linked PRs
The text was updated successfully, but these errors were encountered: