Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection reset during max-requests auto-restart with gthread #3038

Open
christianbundy opened this issue Jul 21, 2023 · 32 comments · May be fixed by #3340 or #3039
Open

Connection reset during max-requests auto-restart with gthread #3038

christianbundy opened this issue Jul 21, 2023 · 32 comments · May be fixed by #3340 or #3039
Assignees

Comments

@christianbundy
Copy link

christianbundy commented Jul 21, 2023

We're observing intermittent HTTP 502s in production, which seem to be correlated with the "autorestarting worker after current request" log line, and are less frequent as we increase max_requests. I've reproduced this on 21.2.0 and 20.1.0, but it doesn't seem to happen in 20.0.4.

I've produced a minimal reproduction case following the gunicorn.org example as closely as possible, but please let me know if there are other changes you'd recommend:

Application

def app(environ, start_response):
    data = b"Hello, World!\n"
    start_response("200 OK", [
        ("Content-Type", "text/plain"),
        ("Content-Length", str(len(data)))
    ])
    return iter([data])

Configuration

gunicorn --max-requests 10 --worker-class gthread myapp:app

Reproduction

#!/usr/bin/env bash

set -euo pipefail

# start gunicorn
gunicorn --max-requests 10 --log-level debug --worker-class gthread myapp:app &

# wait for gunicorn to be ready
sleep 5

# send 20 sequential HTTP requests with minimal output
for i in {1..20}; do
    curl -S -s -o /dev/null -w "[REPRO]: Received HTTP %{http_code} from GET /${i}\n" "127.0.0.1:8000/${i}" || true
done

Quickstart

For convenience, I've packaged this into a single command that consistently reproduces the problem on my machine. If you have Docker installed, this should Just Work™️.

cat <<DOCKER_EOF | docker build -t gunicorn-repro-max-requests-rst - && docker run --rm gunicorn-repro-max-requests-rst
FROM python:3.11

RUN pip install --disable-pip-version-check --no-cache gunicorn

RUN cat <<EOF > myapp.py
def app(environ, start_response):
    data = b"Hello, World!\n"
    start_response("200 OK", [
        ("Content-Type", "text/plain"),
        ("Content-Length", str(len(data)))
    ])
    return iter([data])
EOF

RUN cat <<EOF > repro.sh
#!/usr/bin/env bash

set -euo pipefail

# start gunicorn
gunicorn --max-requests 10 --log-level debug --worker-class gthread myapp:app &

# wait for gunicorn to be ready
sleep 5

# send 20 sequential HTTP requests with minimal output
for i in {1..20}; do
    curl -S -s -o /dev/null -w "[REPRO]: Received HTTP %{http_code} from GET /\\\${i}\n" "127.0.0.1:8000/\\\${i}" || true
done
EOF

RUN chmod +x repro.sh

CMD ["./repro.sh"]

DOCKER_EOF

Example

Logs

[2023-07-21 00:20:10 +0000] [6] [DEBUG] Current configuration:
  config: ./gunicorn.conf.py
  wsgi_app: None
  bind: ['127.0.0.1:8000']
  backlog: 2048
  workers: 1
  worker_class: gthread
  threads: 1
  worker_connections: 1000
  max_requests: 10
  max_requests_jitter: 0
  timeout: 30
  graceful_timeout: 30
  keepalive: 2
  limit_request_line: 4094
  limit_request_fields: 100
  limit_request_field_size: 8190
  reload: False
  reload_engine: auto
  reload_extra_files: []
  spew: False
  check_config: False
  print_config: False
  preload_app: False
  sendfile: None
  reuse_port: False
  chdir: /
  daemon: False
  raw_env: []
  pidfile: None
  worker_tmp_dir: None
  user: 0
  group: 0
  umask: 0
  initgroups: False
  tmp_upload_dir: None
  secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
  forwarded_allow_ips: ['127.0.0.1']
  accesslog: None
  disable_redirect_access_to_syslog: False
  access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
  errorlog: -
  loglevel: debug
  capture_output: False
  logger_class: gunicorn.glogging.Logger
  logconfig: None
  logconfig_dict: {}
  logconfig_json: None
  syslog_addr: udp://localhost:514
  syslog: False
  syslog_prefix: None
  syslog_facility: user
  enable_stdio_inheritance: False
  statsd_host: None
  dogstatsd_tags: 
  statsd_prefix: 
  proc_name: None
  default_proc_name: myapp:app
  pythonpath: None
  paste: None
  on_starting: <function OnStarting.on_starting at 0xffff935672e0>
  on_reload: <function OnReload.on_reload at 0xffff93567420>
  when_ready: <function WhenReady.when_ready at 0xffff93567560>
  pre_fork: <function Prefork.pre_fork at 0xffff935676a0>
  post_fork: <function Postfork.post_fork at 0xffff935677e0>
  post_worker_init: <function PostWorkerInit.post_worker_init at 0xffff93567920>
  worker_int: <function WorkerInt.worker_int at 0xffff93567a60>
  worker_abort: <function WorkerAbort.worker_abort at 0xffff93567ba0>
  pre_exec: <function PreExec.pre_exec at 0xffff93567ce0>
  pre_request: <function PreRequest.pre_request at 0xffff93567e20>
  post_request: <function PostRequest.post_request at 0xffff93567ec0>
  child_exit: <function ChildExit.child_exit at 0xffff92db8040>
  worker_exit: <function WorkerExit.worker_exit at 0xffff92db8180>
  nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0xffff92db82c0>
  on_exit: <function OnExit.on_exit at 0xffff92db8400>
  ssl_context: <function NewSSLContext.ssl_context at 0xffff92db85e0>
  proxy_protocol: False
  proxy_allow_ips: ['127.0.0.1']
  keyfile: None
  certfile: None
  ssl_version: 2
  cert_reqs: 0
  ca_certs: None
  suppress_ragged_eofs: True
  do_handshake_on_connect: False
  ciphers: None
  raw_paste_global_conf: []
  strip_header_spaces: False
[2023-07-21 00:20:10 +0000] [6] [INFO] Starting gunicorn 21.2.0
[2023-07-21 00:20:10 +0000] [6] [DEBUG] Arbiter booted
[2023-07-21 00:20:10 +0000] [6] [INFO] Listening at: http://127.0.0.1:8000 (6)
[2023-07-21 00:20:10 +0000] [6] [INFO] Using worker: gthread
[2023-07-21 00:20:10 +0000] [8] [INFO] Booting worker with pid: 8
[2023-07-21 00:20:10 +0000] [6] [DEBUG] 1 workers
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /1
[REPRO]: Received HTTP 200 from GET /1
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. 
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /2
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /2
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /3
[REPRO]: Received HTTP 200 from GET /3
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. 
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /4
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /4
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /5
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /5
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /6
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /6
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /7
[REPRO]: Received HTTP 200 from GET /7
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. 
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /8
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /8
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /9
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /9
[2023-07-21 00:20:15 +0000] [8] [DEBUG] GET /10
[2023-07-21 00:20:15 +0000] [8] [INFO] Autorestarting worker after current request.
[2023-07-21 00:20:15 +0000] [8] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /10
curl: (56) Recv failure: Connection reset by peer
[2023-07-21 00:20:15 +0000] [8] [INFO] Worker exiting (pid: 8)
[REPRO]: Received HTTP 000 from GET /11
[2023-07-21 00:20:15 +0000] [22] [INFO] Booting worker with pid: 22
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /12
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /12
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /13
[REPRO]: Received HTTP 200 from GET /13
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. 
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /14
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /14
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /15
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /15
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /16
[REPRO]: Received HTTP 200 from GET /16
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. 
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /17
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /17
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /18
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /18
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /19
[REPRO]: Received HTTP 200 from GET /19
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. 
[2023-07-21 00:20:15 +0000] [22] [DEBUG] GET /20
[2023-07-21 00:20:15 +0000] [22] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /20

Expected

I'd expect to receive an HTTP 200 for each request, regardless of the max-requests restart. We should see [DEBUG] GET /11 when the worker handles the 11th request.

Actual

The reproduction script sends GET /11, but the worker never sees it, and we see a connection reset instead. The repro script reports a status code of 000, but that's just a quirk of libcurl. I've used tcpdump and can confirm the RST.

In case it's useful, I've also seen curl: (52) Empty reply from server, but it happens less frequently and I'm not 100% sure that it's the same problem.

Workaround

Increasing max-requests makes this happen less frequently, but the only way to resolve it is to disable max-requests (or maybe switch to a different worker type?). Increasing the number of workers or threads doesn't seem to resolve the problem either, from what I've seen.

@christianbundy
Copy link
Author

Maybe useful, here's a hacky fix I put together locally: edit workers/gthread.py so that accept() doesn't process accept new requests from the socket after force_close() is called on the worker:

image

(Apologies for a crappy vim screenshot, I wanted to test this out before sitting down for dinner.)

New output:

[2023-07-21 02:51:31 +0000] [641] [INFO] Starting gunicorn 21.2.0
[2023-07-21 02:51:31 +0000] [641] [DEBUG] Arbiter booted
[2023-07-21 02:51:31 +0000] [641] [INFO] Listening at: http://127.0.0.1:8000 (641)
[2023-07-21 02:51:31 +0000] [641] [INFO] Using worker: gthread
[2023-07-21 02:51:31 +0000] [643] [INFO] Booting worker with pid: 643
[2023-07-21 02:51:31 +0000] [641] [DEBUG] 1 workers
[2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /1
[REPRO]: Received HTTP 200 from GET /1
[2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection. 
[2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /2
[REPRO]: Received HTTP 200 from GET /2
[2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection. 
[2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /3
[REPRO]: Received HTTP 200 from GET /3
[2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection. 
[2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /4
[REPRO]: Received HTTP 200 from GET /4
[2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection. 
[2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /5
[REPRO]: Received HTTP 200 from GET /5
[2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection. 
[2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /6
[REPRO]: Received HTTP 200 from GET /6
[2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection. 
[2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /7
[REPRO]: Received HTTP 200 from GET /7
[2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection. 
[2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /8
[REPRO]: Received HTTP 200 from GET /8
[2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection. 
[2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /9
[REPRO]: Received HTTP 200 from GET /9
[2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection. 
[2023-07-21 02:51:36 +0000] [643] [DEBUG] GET /10
[2023-07-21 02:51:36 +0000] [643] [INFO] Autorestarting worker after current request.
[2023-07-21 02:51:36 +0000] [643] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /10
[2023-07-21 02:51:36 +0000] [643] [DEBUG] accept() called when alive=False
[2023-07-21 02:51:36 +0000] [643] [INFO] Worker exiting (pid: 643)
[2023-07-21 02:51:36 +0000] [656] [INFO] Booting worker with pid: 656
[2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /11
[REPRO]: Received HTTP 200 from GET /11
[2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection. 
[2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /12
[REPRO]: Received HTTP 200 from GET /12
[2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection. 
[2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /13
[REPRO]: Received HTTP 200 from GET /13
[2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection. 
[2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /14
[REPRO]: Received HTTP 200 from GET /14
[2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection. 
[2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /15
[REPRO]: Received HTTP 200 from GET /15
[2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connectio
n. 
[2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /16
[REPRO]: Received HTTP 200 from GET /16
[2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection. 
[2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /17
[REPRO]: Received HTTP 200 from GET /17
[2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection. 
[2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /18
[REPRO]: Received HTTP 200 from GET /18
[2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection. 
[2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /19
[REPRO]: Received HTTP 200 from GET /19
[2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection. 
[2023-07-21 02:51:36 +0000] [656] [DEBUG] GET /20
[2023-07-21 02:51:36 +0000] [656] [INFO] Autorestarting worker after current request.
[2023-07-21 02:51:36 +0000] [656] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /20

@christianbundy
Copy link
Author

christianbundy commented Jul 21, 2023

I've found a related issue, which is not resolved by #3039, although it'd be unlikely to be deployed in the wild: when max_requests >= threads we see the same connection reset error.

For example:

gunicorn --worker-class gthread --max-requests 4 --threads 4 myapp:app

With this config, we can reproduce a consistent connection reset with only five HTTP requests.

[2023-07-21 15:59:17 +0000] [7] [DEBUG] Current configuration:
  config: ./gunicorn.conf.py
  wsgi_app: None
  bind: ['127.0.0.1:8000']
  backlog: 2048
  workers: 1
  worker_class: gthread
  threads: 4
  worker_connections: 1000
  max_requests: 4
  max_requests_jitter: 0
  timeout: 30
  graceful_timeout: 30
  keepalive: 2
  limit_request_line: 4094
  limit_request_fields: 100
  limit_request_field_size: 8190
  reload: False
  reload_engine: auto
  reload_extra_files: []
  spew: False
  check_config: False
  print_config: False
  preload_app: False
  sendfile: None
  reuse_port: False
  chdir: /
  daemon: False
  raw_env: []
  pidfile: None
  worker_tmp_dir: None
  user: 0
  group: 0
  umask: 0
  initgroups: False
  tmp_upload_dir: None
  secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
  forwarded_allow_ips: ['127.0.0.1']
  accesslog: None
  disable_redirect_access_to_syslog: False
  access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
  errorlog: -
  loglevel: debug
  capture_output: False
  logger_class: gunicorn.glogging.Logger
  logconfig: None
  logconfig_dict: {}
  logconfig_json: None
  syslog_addr: udp://localhost:514
  syslog: False
  syslog_prefix: None
  syslog_facility: user
  enable_stdio_inheritance: False
  statsd_host: None
  dogstatsd_tags: 
  statsd_prefix: 
  proc_name: None
  default_proc_name: myapp:app
  pythonpath: None
  paste: None
  on_starting: <function OnStarting.on_starting at 0xffff9b1c72e0>
  on_reload: <function OnReload.on_reload at 0xffff9b1c7420>
  when_ready: <function WhenReady.when_ready at 0xffff9b1c7560>
  pre_fork: <function Prefork.pre_fork at 0xffff9b1c76a0>
  post_fork: <function Postfork.post_fork at 0xffff9b1c77e0>
  post_worker_init: <function PostWorkerInit.post_worker_init at 0xffff9b1c7920>
  worker_int: <function WorkerInt.worker_int at 0xffff9b1c7a60>
  worker_abort: <function WorkerAbort.worker_abort at 0xffff9b1c7ba0>
  pre_exec: <function PreExec.pre_exec at 0xffff9b1c7ce0>
  pre_request: <function PreRequest.pre_request at 0xffff9b1c7e20>
  post_request: <function PostRequest.post_request at 0xffff9b1c7ec0>
  child_exit: <function ChildExit.child_exit at 0xffff9aa18040>
  worker_exit: <function WorkerExit.worker_exit at 0xffff9aa18180>
  nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0xffff9aa182c0>
  on_exit: <function OnExit.on_exit at 0xffff9aa18400>
  ssl_context: <function NewSSLContext.ssl_context at 0xffff9aa185e0>
  proxy_protocol: False
  proxy_allow_ips: ['127.0.0.1']
  keyfile: None
  certfile: None
  ssl_version: 2
  cert_reqs: 0
  ca_certs: None
  suppress_ragged_eofs: True
  do_handshake_on_connect: False
  ciphers: None
  raw_paste_global_conf: []
  strip_header_spaces: False
[2023-07-21 15:59:17 +0000] [7] [INFO] Starting gunicorn 21.2.0
[2023-07-21 15:59:17 +0000] [7] [DEBUG] Arbiter booted
[2023-07-21 15:59:17 +0000] [7] [INFO] Listening at: http://127.0.0.1:8000 (7)
[2023-07-21 15:59:17 +0000] [7] [INFO] Using worker: gthread
[2023-07-21 15:59:17 +0000] [9] [INFO] Booting worker with pid: 9
[2023-07-21 15:59:17 +0000] [7] [DEBUG] 1 workers
[2023-07-21 15:59:22 +0000] [9] [DEBUG] GET /1
[2023-07-21 15:59:22 +0000] [9] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /1
[2023-07-21 15:59:22 +0000] [9] [DEBUG] GET /2
[2023-07-21 15:59:22 +0000] [9] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /2
[2023-07-21 15:59:22 +0000] [9] [DEBUG] GET /3
[2023-07-21 15:59:22 +0000] [9] [DEBUG] Closing connection. 
[REPRO]: Received HTTP 200 from GET /3
[2023-07-21 15:59:22 +0000] [9] [DEBUG] GET /4
[2023-07-21 15:59:22 +0000] [9] [INFO] Autorestarting worker after current request.
[2023-07-21 15:59:22 +0000] [9] [DEBUG] Closing connection.
[REPRO]: Received HTTP 200 from GET /4
curl: (56) Recv failure: Connection reset by peer
[2023-07-21 15:59:22 +0000] [9] [INFO] Worker exiting (pid: 9)
[REPRO]: Received HTTP 000 from GET /5

@benoitc benoitc self-assigned this Jul 23, 2023
@christianbundy
Copy link
Author

After a git bisect, it looks like the culprit is 0ebb73a (#2918).

@MANT5149
Copy link

MANT5149 commented Aug 8, 2023

We are seeing this when making API calls to NetBox as well. gunicorn version is 21.2.0. NGINX logs randomly show an upstream error "104: Connection reset by peer" which we correlate with "Autorestarting worker after current request." in gunicorn logs.

@karlism
Copy link

karlism commented Aug 10, 2023

@MANT5149, I'm in the same boat as well, after NetBox upgrade to 3.5.7 (which includes gunicorn 21.2.0) we're seeing same issue when autorestart is happening:

Aug 10 11:40:43 hostname gunicorn[595183]: [2023-08-10 11:40:43 +0000] [595183] [INFO] Autorestarting worker after current request.
Aug 10 11:40:43 hostname gunicorn[595183]: [2023-08-10 11:40:43 +0000] [595183] [INFO] Worker exiting (pid: 595183)
Aug 10 11:40:44 hostname gunicorn[613565]: [2023-08-10 11:40:44 +0000] [613565] [INFO] Booting worker with pid: 613565
Aug 10 12:33:36 hostname gunicorn[622501]: [2023-08-10 12:33:36 +0000] [622501] [INFO] Autorestarting worker after current request.
Aug 10 12:33:36 hostname gunicorn[622501]: [2023-08-10 12:33:36 +0000] [622501] [INFO] Worker exiting (pid: 622501)
Aug 10 12:33:36 hostname gunicorn[639160]: [2023-08-10 12:33:36 +0000] [639160] [INFO] Booting worker with pid: 639160
Aug 10 13:00:06 hostname gunicorn[579373]: [2023-08-10 13:00:06 +0000] [579373] [INFO] Autorestarting worker after current request.
Aug 10 13:00:06 hostname gunicorn[579373]: [2023-08-10 13:00:06 +0000] [579373] [INFO] Worker exiting (pid: 579373)
Aug 10 13:00:07 hostname gunicorn[648814]: [2023-08-10 13:00:07 +0000] [648814] [INFO] Booting worker with pid: 648814

Nginx error log:

2023/08/10 11:40:43 [error] 1092#0: *744453 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.16.xxx.xxx, server: netbox.example.com, request: "GET /api/tenancy/tenants/19/ HTTP/1.1", upstream: "http://127.0.0.1:8001/api/tenancy/tenants/19/", host: "netbox-api.example.com"
2023/08/10 12:33:36 [error] 1092#0: *776456 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 192.168.xxx.xxx, server: netbox.example.com, request: "GET /api/virtualization/clusters/46/ HTTP/1.1", upstream: "http://127.0.0.1:8001/api/virtualization/clusters/46/", host: "netbox-api.example.com"
2023/08/10 13:00:06 [error] 1092#0: *787694 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.16.xxx.xxx, server: netbox.example.com, request: "GET /api/tenancy/tenants/19/ HTTP/1.1", upstream: "http://127.0.0.1:8001/api/tenancy/tenants/19/", host: "netbox-api.example.com"

Downgrading Gunicorn to 20.1.0 fixes this issue.

@christianbundy
Copy link
Author

Have y'all tried the patch in #3039?

@benoitc
Copy link
Owner

benoitc commented Aug 12, 2023

thanks for the report. I'ts expected to have the worker restart there. If a request already land in the worker when its closing that may happen. Maybe latest change accepting the requests faster trigger it. Can you give an idee of the number of concurrent request gunicorn is receiving also in usial case?

Beside that why do you put a max request so low? This function should only be used as a work around some temporarymemory issues in the application.

@christianbundy
Copy link
Author

christianbundy commented Aug 12, 2023

The worker restart is expected, but after 0ebb73a (#2918) every request runs through the while self.alive loop twice:

  1. the first time through the loop it calls accept() on the socket to create the connection
  2. the second time through the loop it calls recv() on the connection to handle the request and provide a response

This means that during a max-requests restart, we call accept() on requests but never recv(), which means that they aren't added to self.futures to be awaited during the graceful timeout.

Before the change, if you sent two requests around the same time you'd see:

  1. self.alive is True so we:
    1. accept() request A
    2. recv() request A, which sets self.alive = False in handle_request
    3. worker restart
  2. self.alive is False, so we exit the loop and restart the worker
  3. self.alive is True, so we:
    1. accept() request B
    2. recv() request B

After the change, this becomes:

  1. self.alive is True so we accept() request A
  2. self.alive is True so we:
    1. recv() request A, which sets self.alive = False in handle_request
    2. accept() request B
  3. self.alive is False, so we exit the loop and restart the worker (‼️ without handling request B ‼️)

Can you give an idee of the number of concurrent request gunicorn is receiving also in usial case?

This bug only requires two concurrent requests to a worker, but often I'll often have ~4 concurrent requests per worker and this bug means that 1 will be completed and the rest have their connections reset. To highlight the impact, this means that setting max-requests at 10k and averaging 4 concurrent requests will cause 0.03% of HTTP requests to fail. EDIT: I've included a benchmark in #3039 instead of speculating.

Beside that why do you put a max request so low? This function should only be used as a work around some temporarymemory issues in the application.

That's a minimal reproducible example, not the real application + configuration I'm using in production. I'm using a range from 1,000 to 1,000,000 depending on the exact application / deployment, and completely agree that --max-requests 10 is inappropriately low for most scenarios.

@christianbundy
Copy link
Author

@benoitc I just added a benchmark to #3039 showing an error rate of 0.09% when we restart every 1024 requests, in case that's helpful.

@karlism
Copy link

karlism commented Aug 14, 2023

Have y'all tried the patch in #3039?

@christianbundy, I've tried it and it appears to fix this issue for me

@r-lindner
Copy link

#3039 fixes this issue BUT for me the idle load goes up from 0.x to 5!

@christianbundy
Copy link
Author

0.x to 5

Are these CPU utilization percentages? i.e. CPU load goes increases from <1% to 5%? This makes sense because previously we were blocking and waiting on IO for two seconds while waiting for new connections, but now we're looping to check for either 'new connections' or 'new data on a connection'.

@r-lindner
Copy link

no, 5 means 5 cpu cores are 100% utilized (500%), I have 5 workers configured. The arrows point to bumps where I ran a script that fires many requests against a gunicorn server.

netbox_load

@christianbundy
Copy link
Author

Thanks for the context @r-lindner -- admittedly I don't think I understand the difference between 'fix from #3038' and 'fix from #3039', but I can see that the CPU utilization is significantly higher.

I've just pushed a commit to my PR branch that resolves the issue on my machine, can you confirm whether it works for you? It's a significantly different approach, which I haven't tested as thoroughly, but it seems to resolve the connection resets and also keeps the CPU utilization low.

Before: idles at ~10% CPU

Screenshot 2023-08-26 at 1 28 38 PM

After: idles at ~0.01% CPU

Screenshot 2023-08-26 at 1 28 54 PM

@r-lindner
Copy link

Maybe useful, here's a hacky fix I put together locally: edit workers/gthread.py so that accept() doesn't process accept new requests from the socket after force_close() is called on the worker:
image

(Apologies for a crappy vim screenshot, I wanted to test this out before sitting down for dinner.)

New output:

this is the 'fix from #3038' and the 'fix from #3039' was the pull request with out the changes from Aug 26. I am now using the updated #3039 without CPU issues. Due to changes I made a week ago I cannot test if the original bug is fixed but I guess you already tested this :-) So this looks good.

@zdicesare
Copy link

zdicesare commented Nov 4, 2023

When fronting gunicorn 20.1.0 with nginx 1.23.3, we observe "connection reset by peer" errors in Nginx that correlate with gthread worker auto restarts.

#1236 seems related, which describes an issue specifically with keepalive connections. That issue is older and I am unsure of the current state, but this comment implies an ongoing issue. Note the original reproduction steps in this issue, 3038, have keepalive enabled by default.

When we disable keepalives in gunicorn, we observe a latency regression but it does stop the connection reset errors.

Should there be documented guidance, for now, not to use --max-requests with keepalive + gthread workers?

As far as I can see, options for consumers are:

  • Disable keepalive if using gthread workers + with max requests
  • Possibly use gevent workers instead

@zogzog
Copy link

zogzog commented Nov 16, 2023

We face this issue exactly as described. Thanks for the reporting and ongoing work on this. Is there an ETA for the publication of a fix ?

@jessielw
Copy link

jessielw commented Jan 4, 2024

Same issue here. I swapped to gthread workers from sync, and randomly, my server just stopped taking requests.

Reverted back to sync for now.

@v0tti
Copy link

v0tti commented Jan 4, 2024

We are also running into this issue after a NetBox upgrade. Downgrading gunicorn to 20.1.0 fixes it for the moment but a proper fix would be appreciated.

@nomad-cyanide
Copy link

nomad-cyanide commented Jan 9, 2024

We are also running into this problem after upgrading Netbox from 3.4.8 to 3.6.9, which makes gunicorn go from 20.1.0 to 21.2.0.

One of the heavier scripts works flawlessly on Netbox 3.4.8 (gunicorn 20.1.0), but on 3.6.9 (gunicorn 21.2.0) it fails with the below message and it has not failed at the exact same place twice:

Traceback (most recent call last):
File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 394, in
main()
File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 309, in main
service_entry = nb_update_vs(nb_vs,sat_gw_ip,sat_gw_name,community_name,community_type,interop_hack_ip)
File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 124, in nb_update_vs
nb_service_entry_ip = str(nb_service_entry.ipaddresses[0]).split('/')[0]
File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 327, in str
getattr(self, "name", None)
File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 303, in getattr
if self.full_details():
File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 459, in full_details
self._parse_values(next(req.get()))
File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/query.py", line 291, in get
req = self._make_call(add_params=add_params)
File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/query.py", line 258, in _make_call
raise RequestError(req)
pynetbox.core.query.RequestError: The request failed with code 502 Bad Gateway but more specific details were not returned in json. Check the NetBox Logs or investigate this exception's error attribute.

/var/log/nginx/error.log:

2024/01/08 14:27:35 [error] 919#919: *110908 recv() failed (104: Unknown error) while reading response header from upstream, client: 10.10.10.74, server: netbox-test.domain.test, request: "GET /api/ipam/ip-addresses/24202/ HTTP/1.1", upstream: "http://10.10.10.5:8001/api/ipam/ip-addresses/24202/", host: "netbox-test.domain.test"

gunicorn log:

Jan 08 14:39:25 30001vmnb02-prod gunicorn[1129991]: [2024-01-08 14:39:25 +0100] [1129991] [INFO] Autorestarting worker after current request.
Jan 08 14:39:25 30001vmnb02-prod gunicorn[1129991]: [2024-01-08 14:39:25 +0100] [1129991] [INFO] Worker exiting (pid: 1129991)
Jan 08 14:39:26 30001vmnb02-prod gunicorn[1139845]: [2024-01-08 14:39:26 +0100] [1139845] [INFO] Booting worker with pid: 1139845
Jan 08 14:44:11 30001vmnb02-prod gunicorn[1129962]: [2024-01-08 14:44:11 +0100] [1129962] [INFO] Autorestarting worker after current request.
Jan 08 14:44:11 30001vmnb02-prod gunicorn[1129962]: [2024-01-08 14:44:11 +0100] [1129962] [INFO] Worker exiting (pid: 1129962)
Jan 08 14:44:11 30001vmnb02-prod gunicorn[1139926]: [2024-01-08 14:44:11 +0100] [1139926] [INFO] Booting worker with pid: 1139926

Versions:
Netbox: 3.6.9
Python: 3.10.12
Redis version=6.0.16
nginx version: nginx/1.18.0 (Ubuntu)
psql (PostgreSQL) 14.10 (Ubuntu 14.10-0ubuntu0.22.04.1)
gunicorn (version 21.2.0)
pynetbox: 7.3.3

Linux vmnb02-test 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Is there a release underway to fix this or should we still refrain from upgrading? Current state of gunicorn dictates that it is not production-worthy. :(

@nomad-cyanide
Copy link

nomad-cyanide commented Jan 9, 2024

We are also running into this problem after upgrading Netbox from 3.4.8 to 3.6.9, which makes gunicorn go from 20.1.0 to 21.2.0.

One of the heavier scripts works flawlessly on Netbox 3.4.8 (gunicorn 20.1.0), but on 3.6.9 (gunicorn 21.2.0) it fails with the below message and it has not failed at the exact same place twice:

Traceback (most recent call last):
File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 394, in
main()
File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 309, in main
service_entry = nb_update_vs(nb_vs,sat_gw_ip,sat_gw_name,community_name,community_type,interop_hack_ip)
File "/prod/scripts/getfacts/ts_s2s_vpn_facts.py", line 124, in nb_update_vs
nb_service_entry_ip = str(nb_service_entry.ipaddresses[0]).split('/')[0]
File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 327, in str
getattr(self, "name", None)
File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 303, in getattr
if self.full_details():
File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/response.py", line 459, in full_details
self._parse_values(next(req.get()))
File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/query.py", line 291, in get
req = self._make_call(add_params=add_params)
File "/usr/local/lib/python3.10/dist-packages/pynetbox/core/query.py", line 258, in _make_call
raise RequestError(req)
pynetbox.core.query.RequestError: The request failed with code 502 Bad Gateway but more specific details were not returned in json. Check the NetBox Logs or investigate this exception's error attribute.

/var/log/nginx/error.log:

2024/01/08 14:27:35 [error] 919#919: *110908 recv() failed (104: Unknown error) while reading response header from upstream, client: 10.10.10.74, server: netbox-test.domain.test, request: "GET /api/ipam/ip-addresses/24202/ HTTP/1.1", upstream: "http://10.10.10.5:8001/api/ipam/ip-addresses/24202/", host: "netbox-test.domain.test"

gunicorn log:

Jan 08 14:39:25 30001vmnb02-prod gunicorn[1129991]: [2024-01-08 14:39:25 +0100] [1129991] [INFO] Autorestarting worker after current request.
Jan 08 14:39:25 30001vmnb02-prod gunicorn[1129991]: [2024-01-08 14:39:25 +0100] [1129991] [INFO] Worker exiting (pid: 1129991)
Jan 08 14:39:26 30001vmnb02-prod gunicorn[1139845]: [2024-01-08 14:39:26 +0100] [1139845] [INFO] Booting worker with pid: 1139845
Jan 08 14:44:11 30001vmnb02-prod gunicorn[1129962]: [2024-01-08 14:44:11 +0100] [1129962] [INFO] Autorestarting worker after current request.
Jan 08 14:44:11 30001vmnb02-prod gunicorn[1129962]: [2024-01-08 14:44:11 +0100] [1129962] [INFO] Worker exiting (pid: 1129962)
Jan 08 14:44:11 30001vmnb02-prod gunicorn[1139926]: [2024-01-08 14:44:11 +0100] [1139926] [INFO] Booting worker with pid: 1139926

Versions: Netbox: 3.6.9 Python: 3.10.12 Redis version=6.0.16 nginx version: nginx/1.18.0 (Ubuntu) psql (PostgreSQL) 14.10 (Ubuntu 14.10-0ubuntu0.22.04.1) gunicorn (version 21.2.0) pynetbox: 7.3.3

Linux vmnb02-test 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Is there a release underway to fix this or should we still refrain from upgrading? Current state of gunicorn dictates that it is not production-worthy. :(

I have been testing a few things and this is my finding.
It doesn't help if I decrease the max_requests setting. It just fails sooner.
it doesn't help if I increase the max_requests setting either. It still fails at some point.
When it fails a log-entry about restarting shows up simultaneously:

Example:

an 09 13:20:07 30001vmnb02-test gunicorn[106082]: [2024-01-09 13:20:07 +0100] [106082] [INFO] Autorestarting worker after current request.
Jan 09 13:20:07 30001vmnb02-test gunicorn[106082]: [2024-01-09 13:20:07 +0100] [106082] [INFO] Worker exiting (pid: 106082)
Jan 09 13:20:08 30001vmnb02-test gunicorn[106256]: [2024-01-09 13:20:08 +0100] [106256] [INFO] Booting worker with pid: 106256

If I set the max_requests to 0, disabling it, my scripts work. without error. But is this preferable to having gunicorn processes restart regularly. I suppose it would start consuming memory, if it has memory-leak errors that is.

Perhaps a scheduled restart of the Netbox and netbox-rq services (thereby restarting gunicorn worker processes) once a day would do the trick?

@nomad-cyanide
Copy link

If I set the max_requests to 0, disabling it, my scripts work. without error. But is this preferable to having gunicorn processes restart regularly. I suppose it would start consuming memory, if it has memory-leak errors that is.

Perhaps a scheduled restart of the Netbox and netbox-rq services (thereby restarting gunicorn worker processes) once a day would do the trick?

I have come to the conclusion, that rather than downgrade gunicorn of maybe loose some necessary features, I will go ahead with max_requests set i 0 and if memory usage becomes an issue on the server I will set up a scheduled job that restarts the worker processes with this command:

ps -aux | grep venv/bin/gunicorn | grep Ss | awk '{ system("kill -HUP " $2 )}'

@benoitc
Copy link
Owner

benoitc commented Jan 11, 2024 via email

@zogzog
Copy link

zogzog commented Feb 8, 2024

For the record, there exists a variety of situation where the memory leaks are difficult to address:

  • they're costly to track and fix (in our code bases, provided it comes from there)
  • they may be in third-party libs

@rajivramanathan
Copy link

rajivramanathan commented May 4, 2024

We stayed on a lower release version to avoid this issue. However, we have to upgrade due to HTTP Request Smuggling (CVE-2024-1135) vulnerability. Is there anyone able to successfully workaround this issue (short of turning off max-requests)?

@benoitc
Copy link
Owner

benoitc commented May 4, 2024

@rajivramanathan don't use max-requests? Max requests is there for the worst casse when your application leaks.

@zogzog
Copy link

zogzog commented May 4, 2024

Benoît, I think we understand your advice, but many apps may find themselves in the "application leaks and we can't do much about it" place, hence the usefulness of max-requests.

@hansukyang
Copy link

We stayed on a lower release version to avoid this issue. However, we have to upgrade due to HTTP Request Smuggling (CVE-2024-1135) vulnerability. Is there anyone able to successfully workaround this issue (short of turning off max-requests)?

We have NGINX in front of Gunicorn so we addressed it by setting up multiple instances of Gunicorn running upstream listening to different ports and using http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_next_upstream configuration to try next upstream if we encounter 502 error.

@martinkirch-regmind
Copy link

Hello,
In our setting max-request is very helpful too, and we ran into the same issue when upgrading to gunicorn>=21.
I wanted to mention that #3157 fixes it !

@zogzog
Copy link

zogzog commented Sep 4, 2024

Just for the record, there exists a situation where the Python default memory allocator produces (under specific circumstances) very fragmented arenas which leads to the interpreter not giving back unused memory - and this might be our case. Using jemalloc (see https://zapier.com/engineering/celery-python-jemalloc/) may alleviate the issue. We are considering this, too.
However if #3157 is green, we will be happy to keep using max-requests.

@charlesoconor
Copy link

Have run into this in my production so would love to see the patch merged. Have pulled it in and have seen a reduction in 503s from the connection reset.

@aszubarev
Copy link

@christianbundy, thank you for your explanation and patch #3039. I faced the same problem.

I've been thinking about this situation for a long time.

On the one hand, there are projects where Gunicorn is used without a reverse proxy. Users have faced the problem of #2917 blocking.

On the other hand, there are problems with connection resets after restart. In projects that use Gunicorn together with a reverse proxy, the behavior before the change #2918 is expected.

I think we need a compromise in the form of an additional setting. By default, so that the worker behavior is the same as before the change #2918. But if it is necessary to handle many speculative connections, then users can enable the option --worker-connections-enqueue-async #3340.

@benoitc, what do you think about this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment