-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection reset during max-requests auto-restart with gthread #3038
Comments
I've found a related issue, For example: gunicorn --worker-class gthread --max-requests 4 --threads 4 myapp:app With this config, we can reproduce a consistent connection reset with only five HTTP requests.
|
We are seeing this when making API calls to NetBox as well. gunicorn version is 21.2.0. NGINX logs randomly show an upstream error "104: Connection reset by peer" which we correlate with "Autorestarting worker after current request." in gunicorn logs. |
@MANT5149, I'm in the same boat as well, after NetBox upgrade to 3.5.7 (which includes gunicorn 21.2.0) we're seeing same issue when autorestart is happening:
Nginx error log:
Downgrading Gunicorn to 20.1.0 fixes this issue. |
Have y'all tried the patch in #3039? |
thanks for the report. I'ts expected to have the worker restart there. If a request already land in the worker when its closing that may happen. Maybe latest change accepting the requests faster trigger it. Can you give an idee of the number of concurrent request gunicorn is receiving also in usial case? Beside that why do you put a max request so low? This function should only be used as a work around some temporarymemory issues in the application. |
The worker restart is expected, but after 0ebb73a (#2918) every request runs through the
This means that during a max-requests restart, we call Before the change, if you sent two requests around the same time you'd see:
After the change, this becomes:
This bug only requires two concurrent requests to a worker, but often I'll often have ~4 concurrent requests per worker and this bug means that 1 will be completed and the rest have their connections reset.
That's a minimal reproducible example, not the real application + configuration I'm using in production. I'm using a range from 1,000 to 1,000,000 depending on the exact application / deployment, and completely agree that |
@christianbundy, I've tried it and it appears to fix this issue for me |
#3039 fixes this issue BUT for me the idle load goes up from 0.x to 5! |
Are these CPU utilization percentages? i.e. CPU load goes increases from <1% to 5%? This makes sense because previously we were blocking and waiting on IO for two seconds while waiting for new connections, but now we're looping to check for either 'new connections' or 'new data on a connection'. |
Thanks for the context @r-lindner -- admittedly I don't think I understand the difference between 'fix from #3038' and 'fix from #3039', but I can see that the CPU utilization is significantly higher. I've just pushed a commit to my PR branch that resolves the issue on my machine, can you confirm whether it works for you? It's a significantly different approach, which I haven't tested as thoroughly, but it seems to resolve the connection resets and also keeps the CPU utilization low. Before: idles at ~10% CPU After: idles at ~0.01% CPU |
this is the 'fix from #3038' and the 'fix from #3039' was the pull request with out the changes from Aug 26. I am now using the updated #3039 without CPU issues. Due to changes I made a week ago I cannot test if the original bug is fixed but I guess you already tested this :-) So this looks good. |
When fronting gunicorn 20.1.0 with nginx 1.23.3, we observe "connection reset by peer" errors in Nginx that correlate with gthread worker auto restarts. #1236 seems related, which describes an issue specifically with keepalive connections. That issue is older and I am unsure of the current state, but this comment implies an ongoing issue. Note the original reproduction steps in this issue, 3038, have keepalive enabled by default. When we disable keepalives in gunicorn, we observe a latency regression but it does stop the connection reset errors. Should there be documented guidance, for now, not to use As far as I can see, options for consumers are:
|
We face this issue exactly as described. Thanks for the reporting and ongoing work on this. Is there an ETA for the publication of a fix ? |
Same issue here. I swapped to gthread workers from sync, and randomly, my server just stopped taking requests. Reverted back to sync for now. |
We are also running into this issue after a NetBox upgrade. Downgrading gunicorn to 20.1.0 fixes it for the moment but a proper fix would be appreciated. |
We are also running into this problem after upgrading Netbox from 3.4.8 to 3.6.9, which makes gunicorn go from 20.1.0 to 21.2.0. One of the heavier scripts works flawlessly on Netbox 3.4.8 (gunicorn 20.1.0), but on 3.6.9 (gunicorn 21.2.0) it fails with the below message and it has not failed at the exact same place twice:
/var/log/nginx/error.log:
gunicorn log:
Versions: Linux vmnb02-test 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux Is there a release underway to fix this or should we still refrain from upgrading? Current state of gunicorn dictates that it is not production-worthy. :( |
I have been testing a few things and this is my finding. Example:
If I set the max_requests to 0, disabling it, my scripts work. without error. But is this preferable to having gunicorn processes restart regularly. I suppose it would start consuming memory, if it has memory-leak errors that is. Perhaps a scheduled restart of the Netbox and netbox-rq services (thereby restarting gunicorn worker processes) once a day would do the trick? |
I have come to the conclusion, that rather than downgrade gunicorn of maybe loose some necessary features, I will go ahead with max_requests set i 0 and if memory usage becomes an issue on the server I will set up a scheduled job that restarts the worker processes with this command:
|
Just don't pass the max_request option? i never use it myself. It's only
when i have a temporary memory leak and never in production.
Le jeu. 11 janv. 2024 à 13:13, Ian S. ***@***.***> a écrit :
… If I set the max_requests to 0, disabling it, my scripts work. without
error. But is this preferable to having gunicorn processes restart
regularly. I suppose it would start consuming memory, if it has memory-leak
errors that is.
Perhaps a scheduled restart of the Netbox and netbox-rq services (thereby
restarting gunicorn worker processes) once a day would do the trick?
I have come to the conclusion, that rather than downgrade gunicorn of
maybe loose some necessary features, I will go ahead with max_requests set
i 0 and if memory usage becomes an issue on the server I will set up a
scheduled job that restarts the worker processes with this command:
ps -aux | grep venv/bin/gunicorn | grep Ss | awk '{ system("kill -HUP " $2 )}'
—
Reply to this email directly, view it on GitHub
<#3038 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADRIW3Y346USVNBLVBBXLYN7JP7AVCNFSM6AAAAAA2SEKTSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBXGAZTIMZVHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
For the record, there exists a variety of situation where the memory leaks are difficult to address:
|
We stayed on a lower release version to avoid this issue. However, we have to upgrade due to HTTP Request Smuggling (CVE-2024-1135) vulnerability. Is there anyone able to successfully workaround this issue (short of turning off max-requests)? |
@rajivramanathan don't use max-requests? Max requests is there for the worst casse when your application leaks. |
Benoît, I think we understand your advice, but many apps may find themselves in the "application leaks and we can't do much about it" place, hence the usefulness of max-requests. |
We have NGINX in front of Gunicorn so we addressed it by setting up multiple instances of Gunicorn running upstream listening to different ports and using http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_next_upstream configuration to try next upstream if we encounter 502 error. |
Hello, |
Just for the record, there exists a situation where the Python default memory allocator produces (under specific circumstances) very fragmented arenas which leads to the interpreter not giving back unused memory - and this might be our case. Using jemalloc (see https://zapier.com/engineering/celery-python-jemalloc/) may alleviate the issue. We are considering this, too. |
Have run into this in my production so would love to see the patch merged. Have pulled it in and have seen a reduction in 503s from the connection reset. |
@christianbundy, thank you for your explanation and patch #3039. I faced the same problem. I've been thinking about this situation for a long time. On the one hand, there are projects where Gunicorn is used without a reverse proxy. Users have faced the problem of #2917 blocking. On the other hand, there are problems with connection resets after restart. In projects that use Gunicorn together with a reverse proxy, the behavior before the change #2918 is expected. I think we need a compromise in the form of an additional setting. By default, so that the worker behavior is the same as before the change #2918. But if it is necessary to handle many speculative connections, then users can enable the option @benoitc, what do you think about this? |
We're observing intermittent HTTP 502s in production, which seem to be correlated with the "autorestarting worker after current request" log line, and are less frequent as we increase
max_requests
. I've reproduced this on 21.2.0 and 20.1.0, but it doesn't seem to happen in 20.0.4.I've produced a minimal reproduction case following the gunicorn.org example as closely as possible, but please let me know if there are other changes you'd recommend:
Application
Configuration
Reproduction
Quickstart
For convenience, I've packaged this into a single command that consistently reproduces the problem on my machine. If you have Docker installed, this should Just Work™️.
Example
Logs
Expected
I'd expect to receive an HTTP 200 for each request, regardless of the max-requests restart. We should see
[DEBUG] GET /11
when the worker handles the 11th request.Actual
The reproduction script sends
GET /11
, but the worker never sees it, and we see a connection reset instead. The repro script reports a status code of000
, but that's just a quirk of libcurl. I've used tcpdump and can confirm theRST
.In case it's useful, I've also seen
curl: (52) Empty reply from server
, but it happens less frequently and I'm not 100% sure that it's the same problem.Workaround
Increasing max-requests makes this happen less frequently, but the only way to resolve it is to disable max-requests (or maybe switch to a different worker type?). Increasing the number of workers or threads doesn't seem to resolve the problem either, from what I've seen.
The text was updated successfully, but these errors were encountered: