-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client & server threads block each other due to incorrect eventlet/greenlet imports #130
Labels
Comments
lathiat
added a commit
to lathiat/prometheus-openstack-exporter
that referenced
this issue
Jun 10, 2024
Currently, slow running OpenStack API Requests (either stuck connecting or still waiting for the actual response) from the periodic DataGatherer task will block HTTPServer connections from being processed. Blocked HTTPServer connections will also block both other connections and the DataGatherer task. Observed Symptoms: - Slow or failed prometheus requests - Statistics not being updated as often as you would expect - HTTP 500 responses and BrokenPipeError tracebacks being logged due to later trying to respond to prometheus clients which timed out and disconnected the socket - Hitting the forked process limit This happens because in the current code, we are intending to use the eventlet library for asynchronous non-blocking I/O, but we are not using it correctly. All code within the main application and all imported dependencies must import the special eventlet "green" versions of many python libraries (e.g. socket, time, threading, SimpleHTTPServer, etc) which yield to other green threads when they would have blocked waiting for I/O or to sleep. Currently this does not always happen. Fix this by importing eventlet and using eventlet.patcher.monkey_patch() before importing any other modules. This will automatically intercept all future imports (including those inside dependencies) and automatically load the green versions of relevant libraries. Documentation on correctly import eventlet can be found here: https://eventlet.readthedocs.io/en/latest/patching.html A detailed and comprehensive analysis of the issue and multiple previous attempts to fix it can be found in Issue canonical#130. If you intend to make further related changes to the use of eventlet, threads or forked processes please read the detailed history lesson available there. Fixes: canonical#130, canonical#126, canonical#124, canonical#116, canonical#115, canonical#112
lathiat
added a commit
to lathiat/prometheus-openstack-exporter
that referenced
this issue
Jun 12, 2024
Currently, slow running OpenStack API Requests (either stuck connecting or still waiting for the actual response) from the periodic DataGatherer task will block HTTPServer connections from being processed. Blocked HTTPServer connections will also block both other connections and the DataGatherer task. Observed Symptoms: - Slow or failed prometheus requests - Statistics not being updated as often as you would expect - HTTP 500 responses and BrokenPipeError tracebacks being logged due to later trying to respond to prometheus clients which timed out and disconnected the socket - Hitting the forked process limit This happens because in the current code, we are intending to use the eventlet library for asynchronous non-blocking I/O, but we are not using it correctly. All code within the main application and all imported dependencies must import the special eventlet "green" versions of many python libraries (e.g. socket, time, threading, SimpleHTTPServer, etc) which yield to other green threads when they would have blocked waiting for I/O or to sleep. Currently this does not always happen. Fix this by importing eventlet and using eventlet.patcher.monkey_patch() before importing any other modules. This will automatically intercept all future imports (including those inside dependencies) and automatically load the green versions of relevant libraries. Documentation on correctly import eventlet can be found here: https://eventlet.readthedocs.io/en/latest/patching.html A detailed and comprehensive analysis of the issue and multiple previous attempts to fix it can be found in Issue canonical#130. If you intend to make further related changes to the use of eventlet, threads or forked processes please read the detailed history lesson available there. Fixes: canonical#130, canonical#126, canonical#124, canonical#116, canonical#115, canonical#112
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Currently, slow running OpenStack API Requests (either stuck connecting or still waiting for the actual response) from the periodic DataGatherer task will block the HTTPServer connections from being processed.
The reverse is also true, a stalled client of the HTTPServer (e.g. opening a telnet session and not sending a request) will also block both the DataGatherer task and processing of other HTTPServer connections.
Observed Symptoms
Cause
This happens because in the current code, we are intending to use the eventlet library for asynchronous non-blocking I/O, but, we are not using it correctly.
All code within the main application and all imported dependencies must import the special eventlet "green" versions of many python libraries (e.g. socket, time, threading, SimpleHTTPServer, etc) which yield to other green threads when they would have blocked waiting for I/O or to sleep. Currently this is not always done, as a result we often block other tasks from running.
In the past we also tried to use a threaded/forked model and avoid eventlet, however the python cinderclient library imports the green eventlet.sleep (unknown to us, I believe this is a bug) and thus we would sometimes get the error "greenlet.error: cannot switch to a different thread".
Fix
Fix this by ensuring the entire application is correctly using eventlet and green patched functions by importing eventlet and using eventlet.patcher.monkey_patch() before importing any other modules. This will automatically intercept every other import and always load the green version of a library.
Testing
To test we now have a working solution, you can
Block access to the Nova API (causes connect to hang for 120 seconds) using this firewall command:
iptables -I OUTPUT -p tcp -m state --state NEW --dport 8774 -j DROP
Make many concurrent and repeated requests using siege:
while true; do siege http://172.16.0.30:9183/metrics -t 5s -c 5 -d 0.1; done
When testing with these changes, I never see us block a server or client connection and all requests take a few milliseconds at most, whether or not the client requests are slow or we open a connection to the server that doesn't send a request.
History Lesson
There have been multiple incorrect attempts to solve this and some related problems. To try and avoid any further such problems, I have comprehensively documented the historical issues and why those fixes have not worked below, both for my understanding and yours :)
eventlet implements asynchronous "non-blocking" socket I/O without any code changes to the application and without using real pthreads by using co-operative "green threads" from the greenlet library.
For this to work correctly, greenlet needs to replace many python standard libraries (e.g. socket, time, threading) with an alternative "green" implementation which intentionally yields execution to other green threads anytime it's expected to block such as when reading data from a file/socket or sleeping.
All code both within the application and all imported dependencies must import these special versions, any code that doesn't won't yield cooperatively and will block other green threads whenever such a blocking function is called.
This does not happen automatically, you can find the full details at https://eventlet.readthedocs.io/en/latest/patching.html but as a brief summary this can be done with 3 different methods:
The original Issue Server in deadlock #112 found that the process deadlocked with the following error: greenlet.error: cannot switch to a different thread
At the time, we used a native Python Thread for the DataGatherer class and separately used the ForkingHTTPServer to allow both functions to operate simultaneously with real threads/processes.
We did not intend to use eventlet/green threads at all, however, the python-cinderclient library incorrectly imports eventlet.sleep which results in sometimes using green threads accidentally, hence the error.
We attempted to fix that in Use eventlet green thread instead of regular thread #115 by importing the green version of threading.Thread explicitly. This avoided the "cannot switch to a different thread" issue by only using green threads and not mixing Python threads and green threads in the same process.
After merging Use eventlet green thread instead of regular thread #115 it was found that the HTTPServer loop never co-operatively yielded to the DataGatherer's thread and the stats were never updated.
To fix this, Fix eventlet patch that was not working right #116 imported the green version of socket, asyncore and time and also littered a few sleep(0) calls around to force co-operative yielding at various points.
This solution was not complete, because it only imported the green version of some libraries, in some call paths. Plus hacked in some extra yields here and there.
In Fix broken symlink python3 and unmet dependencies warnings. #124 we switched from ForkingHTTPServer to the normal HTTPServer because sometimes it would fork too many servers and hit the process or system-wide process limit.
Though not noted elsewhere, when I reproduce this issue by connecting many clients using the tool
siege
to a server where I firewalled the nova API connections, I can see that all of those processes are defunct and not actually alive. This is most likely because the process is blocked and the calls to waitpid which would reap them never happen.Since we are not using the eventlet version of http.server.HTTPServer, without the forked model, we now block anytime we are handling a server request.
Additionally, anytime the DataGatherer green thread calls out through the OpenStack API libraries, it uses non-patched versions of socket/requests/urllib3 and also blocks the HTTPServer which is now inside the same process.
The text was updated successfully, but these errors were encountered: