Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Data race when using static variables with free-threaded Python #5489

Open
2 of 3 tasks
rostan-t opened this issue Jan 7, 2025 · 7 comments · May be fixed by #5494
Open
2 of 3 tasks

[BUG]: Data race when using static variables with free-threaded Python #5489

rostan-t opened this issue Jan 7, 2025 · 7 comments · May be fixed by #5494
Labels
triage New bug, unverified

Comments

@rostan-t
Copy link

rostan-t commented Jan 7, 2025

Required prerequisites

What version (or hash if on master) of pybind11 are you using?

v2.13.6

Problem description

With free-threaded Python, I get data races when using static class variables defined like this:

#include <pybind11/pybind11.h>

namespace py = pybind11;

class Foo {
};

Foo Bar;

PYBIND11_MODULE(foo, m, py::mod_gil_not_used()) {
    py::class_<Foo>(m, "Foo")
        .def_readonly_static("Bar", &Bar);
}

I get multiple kinds of errors when accessing Foo.Bar from multiple threads, from segmentation faults to "pybind11_object_dealloc(): Tried to deallocate unregistered instance!" exceptions.

Further investigation suggested that it might come from Foo.Bar being freed too many times. I'm unsure if it's a pybind11 or CPython issue so I'm posting it here.

Reproducible example code

import threading

from foo import Foo


def create_foo():
    for _ in range(1000):
        Foo.Bar


# keeping a reference to Foo.Bar prevents the issue from happening
# bar = Foo.Bar

nb_threads = 10
threads = [threading.Thread(target=create_foo) for _ in range(nb_threads)]

for thread in threads:
    thread.start()

for thread in threads:
    thread.join()

Is this a regression? Put the last known working version here if it is.

Not a regression

@rostan-t rostan-t added the triage New bug, unverified label Jan 7, 2025
@khotalota
Copy link

This appears to stem from unsafe memory management and reference counting.
When multiple thread access Foo.Bar, pybind creats temproray python wrappers objects and if reference counts are decremented concurrently or improperly then it lead to Premature deallocation, Segmentation fault, expections that you have

So, storing a persistence reference to Foo.bar that you did prevents this reference count from dropping to zero.
However this is just a workaround

To make it thread safe try

  1. std::shared_ptr : it can manage the lifecycle of the static variable safely. Something like below, if this works 🥲
#include <pybind11/pybind11.h>
#include <memory>
#include <mutex>

namespace py = pybind11;

class Foo {};

// Use a shared pointer for Bar
std::shared_ptr<Foo> Bar = std::make_shared<Foo>();
std::mutex bar_mutex;

PYBIND11_MODULE(foo, m) {
  py::class_<Foo>(m, "Foo")
      .def_property_static(
          "Bar",
          [](py::object) {
              std::lock_guard<std::mutex> lock(bar_mutex); // Thread-safe access
              return Bar;
          },
          [](py::object, std::shared_ptr<Foo> new_bar) {
              std::lock_guard<std::mutex> lock(bar_mutex); // Thread-safe modification
              Bar = new_bar;
          },
          py::return_value_policy::reference
      );
}
  1. And also add Mutex for synchronization

This approach should eliminate the race conditions, segmentation faults, and exceptions while ensuring compatibility with free-threaded Python.

@rwgk
Copy link
Collaborator

rwgk commented Jan 13, 2025

Did you already try the reproducer with the latest state of the master branch? In particular, #5419 was merged after the 2.13.6 release (I'm totally not sure if that's related to your issue, but it's a quick experiment).

@colesbury for awareness, JIC this rings any bells.

@rostan-t
Copy link
Author

Thanks for the reply! I tried again on master and it doesn't solve the issue.

@colesbury
Copy link
Contributor

I'm able to reproduce the failure. I'll look into it.

@colesbury
Copy link
Contributor

Ok, I think I understand the issue. There's a race between object deallocation (pybind11_object_dealloc) and retrieving the instance from the instance map (find_registered_python_instance).

There's a brief period of time when the instance's reference count is zero, but it's not yet removed from the instance map. During this time, another thread may retrieve the instance from the map and increment its reference count) Deallocation still proceeds as normal, but now there's an instance that thinks it's registered but is not in the instance map. This crashes on subsequent deallocation calls.

In CPython, we use _Py_TryIncref for similar cases. That's not a public API, but we can work on making it available (probably as an "unstable" API). Basically, find_registered_python_instance should only return an instance if that instance's reference count is not zero (but do so in a thread safe way).

@szalpal
Copy link

szalpal commented Jan 13, 2025

Thank you @colesbury for looking into this. Would this request be a similar thing we are trying to achieve? python/cpython#113920

@colesbury
Copy link
Contributor

Yes

colesbury added a commit to colesbury/pybind11 that referenced this issue Jan 15, 2025
In the free threading build, there's a race between wrapper re-use and
wrapper deallocation. This can happen with a static variable accessed by
multiple threads.

Fixing this requires using some private CPython APIs: _Py_TryIncref and
_PyObject_SetMaybeWeakref. The implementations of these functions are
included until they're made available as public ("unstable") APIs.

Fixes pybind#5489
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage New bug, unverified
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants