Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue congestion (public queue) #1446

Closed
ihsaan-ullah opened this issue May 19, 2024 · 10 comments · Fixed by #1688
Closed

Queue congestion (public queue) #1446

ihsaan-ullah opened this issue May 19, 2024 · 10 comments · Fixed by #1688
Labels
Bug Enhancement Feature suggestions and improvements P1 High priority, but NOT a current blocker

Comments

@ihsaan-ullah
Copy link
Collaborator

We frequently notice that submissions get stuck in the queue. This happens with RabbitMQ when incoming messages rate is greater then outgoing messages.

How can we solve this issue?

  • Can we have multiple Rabbit Queues (one for each Codabench Queue)
  • Can we increase the rate limits so that we don't get this problem
  • Can we have a UI to monitor the queue and interact with the queue?
@Didayolo
Copy link
Member

Didayolo commented May 23, 2024

TODO P1 development:

  • If congestion in the queue detected (e.g. 10+ submissions in "Submitted"), add a warning and prevent new submissions.

The "Submitted" count depends on the queue of the competition (including for default queue).

Problem: sometimes it's stuck on "Submitted", even after service restart, so it may be considered congested indefinitely.


Suggestions:

  • Similarly to Limit CPU usage on the default queue #1141, we could limit the "daily submissions limit" for competitions using the default queue
  • Having more workers would be beneficial now that Codabench is widely adopted
  • Should we optimize the queueing / ordering algorithm? Right now it is a simple first-in-first-out, so a single competition can block a whole queue.

Bug report: all queues blocked

Apparently, when one queue is congested, then all queues get stuck. @ihsaan-ullah Can we confirm the existence of this bug?

@Didayolo Didayolo added Discussion Needs to be discussed before we can come up with specifications and begin the issue Enhancement Feature suggestions and improvements Bug P1 High priority, but NOT a current blocker labels May 23, 2024
@Didayolo
Copy link
Member

There is actually a bug, not simply overload of submissions.

When the bug happens, the workers stopped being able to communicate their status with the main server because of a 500 error.

@ObadaS Do you have some logs to complete this diagnosis?

@Didayolo
Copy link
Member

Didayolo commented May 28, 2024

On going: fix server time (5 minutes late)

To check:

timedatectl status

Set time:

sudo ntpdate ntp.u-psud.fr

@ObadaS
Copy link
Collaborator

ObadaS commented May 28, 2024

[2024-05-28 07:52:15,862: INFO/ForkPoolWorker-155] Updating submission @ https://www.codabench.org/api/submissions/65174/ with data = {'status': 'Preparing', 'status_details': None, 'secret': ''}
[2024-05-28 07:52:16,003: INFO/ForkPoolWorker-155] Submission patch failed with status = 500, and response = 
b'<h1>Server Error (500)</h1>'
[2024-05-28 07:52:16,003: INFO/ForkPoolWorker-155] Updating submission @ https://www.codabench.org/api/submissions/65174/ with data = {'status': 'Failed', 'status_details': 'Failure updating submission data.', 'secret': ''}
[2024-05-28 07:52:16,116: INFO/ForkPoolWorker-155] Submission patch failed with status = 500, and response = 
b'<h1>Server Error (500)</h1>'
[2024-05-28 07:52:16,116: INFO/ForkPoolWorker-155] Destroying submission temp dir: /codabench/tmpuz0or1fz
[2024-05-28 07:52:16,122: ERROR/ForkPoolWorker-155] Task compute_worker_run[...] raised unexpected: SubmissionException('Failure updating submission data.')
Traceback (most recent call last):
  File "/compute_worker.py", line 115, in run_wrapper
    run.prepare()
  File "/compute_worker.py", line 751, in prepare
    self._update_status(STATUS_PREPARING)
  File "/compute_worker.py", line 359, in _update_status
    self._update_submission(data)
  File "/compute_worker.py", line 342, in _update_submission
    raise SubmissionException("Failure updating submission data.")
compute_worker.SubmissionException: Failure updating submission data.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 650, in __protected_call__
    return self.run(*args, **kwargs)
  File "/compute_worker.py", line 123, in run_wrapper
    run._update_status(STATUS_FAILED, str(e))
  File "/compute_worker.py", line 359, in _update_status
    self._update_submission(data)
  File "/compute_worker.py", line 342, in _update_submission
    raise SubmissionException("Failure updating submission data.")
compute_worker.SubmissionException: Failure updating submission data.

I removed the secret in the logs, so them being empty is normal.
Also removed the id that was in the [] after compute_worker_run

@Didayolo
Copy link
Member

Conjointly, users reported that their custom workers stopped receiving submissions. It worked again after we cured the production server (#1455).

@Didayolo Didayolo removed Discussion Needs to be discussed before we can come up with specifications and begin the issue Production labels May 29, 2024
@ObadaS
Copy link
Collaborator

ObadaS commented Jun 7, 2024

It happened again, a Django container restart fixed the problem as expected.
I have more logs to share now:

[2024-06-06 20:34:28 +0000] [13] [INFO] ('172.31.0.5', 55760) - "WebSocket /submission_output/" [accepted]
Internal Server Error: /api/submissions/68303/
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.8/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/viewsets.py", line 116, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 495, in dispatch
    response = self.handle_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 455, in handle_exception
    self.raise_uncaught_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 492, in dispatch
    response = handler(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 84, in partial_update
    return self.update(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 70, in update
    self.perform_update(serializer)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 80, in perform_update
    serializer.save()
  File "/usr/local/lib/python3.8/site-packages/rest_framework/serializers.py", line 209, in save
    self.instance = self.update(self.instance, validated_data)
  File "/app/src/apps/api/serializers/submissions.py", line 228, in update
    loop.run_until_complete(channel_layer.group_send(f"submission_listening_{submission.owner.pk}", {
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.8/site-packages/channels_redis/core.py", line 649, in group_send
    await connection.zremrangebyscore(
aioredis.errors.ConnectionClosedError: Connection has been closed by server
Internal Server Error: /api/submissions/68303/
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.8/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/viewsets.py", line 116, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 495, in dispatch
    response = self.handle_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 455, in handle_exception
    self.raise_uncaught_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 492, in dispatch
    response = handler(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 84, in partial_update
    return self.update(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 70, in update
    self.perform_update(serializer)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 80, in perform_update
    serializer.save()
  File "/usr/local/lib/python3.8/site-packages/rest_framework/serializers.py", line 209, in save
    self.instance = self.update(self.instance, validated_data)
  File "/app/src/apps/api/serializers/submissions.py", line 228, in update
    loop.run_until_complete(channel_layer.group_send(f"submission_listening_{submission.owner.pk}", {
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.8/site-packages/channels_redis/core.py", line 649, in group_send
    await connection.zremrangebyscore(
aioredis.errors.ConnectionClosedError: Connection has been closed by server
Internal Server Error: /api/submissions/68303/
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.8/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/viewsets.py", line 116, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 495, in dispatch
    response = self.handle_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 455, in handle_exception
    self.raise_uncaught_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 492, in dispatch
    response = handler(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 84, in partial_update
    return self.update(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 70, in update
    self.perform_update(serializer)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 80, in perform_update
    serializer.save()
  File "/usr/local/lib/python3.8/site-packages/rest_framework/serializers.py", line 209, in save
    self.instance = self.update(self.instance, validated_data)
  File "/app/src/apps/api/serializers/submissions.py", line 228, in update
    loop.run_until_complete(channel_layer.group_send(f"submission_listening_{submission.owner.pk}", {
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.8/site-packages/channels_redis/core.py", line 649, in group_send
    await connection.zremrangebyscore(
aioredis.errors.ConnectionClosedError: Connection has been closed by server
Internal Server Error: /api/submissions/68303/
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.8/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/viewsets.py", line 116, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 495, in dispatch
    response = self.handle_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 455, in handle_exception
    self.raise_uncaught_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 492, in dispatch
    response = handler(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 84, in partial_update
    return self.update(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 70, in update
    self.perform_update(serializer)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 80, in perform_update
    serializer.save()
  File "/usr/local/lib/python3.8/site-packages/rest_framework/serializers.py", line 209, in save
    self.instance = self.update(self.instance, validated_data)
  File "/app/src/apps/api/serializers/submissions.py", line 228, in update
    loop.run_until_complete(channel_layer.group_send(f"submission_listening_{submission.owner.pk}", {
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.8/site-packages/channels_redis/core.py", line 649, in group_send
    await connection.zremrangebyscore(
aioredis.errors.ConnectionClosedError: Connection has been closed by server

This is the first submission that is having trouble updating its Status from the Worker. After this error starts, all future updates bugs out with the same error, even tasks that were launched hours before the bug happened. While it seems to be a reddis issue, the logs on the reddis container does not show any abnormality.

While looking at the Django logs, I also found some new errors (not sure if they are related to this issue so I'll add them for now and remove them later if they end up not being relevent)
This error starts well before the Django bug related to this issue, which is why I'm not sure if they are related. The log only goes back a few days so I'm not sure when exactly this error starts appearing.
It does happen multiple times in the span of days though.

[2024-06-06 20:32:05 +0000] [12] [INFO] ('172.31.0.5', 52770) - "WebSocket /submission_output/" [accepted]
Error in opening handshake
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/websockets/server.py", line 130, in handler
    path = await self.handshake(
  File "/usr/local/lib/python3.8/site-packages/websockets/server.py", line 584, in handshake
    self.write_http_response(http.HTTPStatus.SWITCHING_PROTOCOLS, response_headers)
  File "/usr/local/lib/python3.8/site-packages/websockets/server.py", line 264, in write_http_response
    self.transport.write(response.encode())
  File "uvloop/handles/stream.pyx", line 674, in uvloop.loop.UVStream.write
  File "uvloop/handles/handle.pyx", line 159, in uvloop.loop.UVHandle._ensure_alive
RuntimeError: unable to perform operation on <TCPTransport closed=True reading=False [...] >; the handler is closed
[2024-06-06 20:32:05 +0000] [12] [ERROR] Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/uvicorn/protocols/websockets/websockets_impl.py", line 157, in run_asgi
    result = await self.app(self.scope, self.asgi_receive, self.asgi_send)
  File "/usr/local/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 45, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.8/site-packages/uvicorn/middleware/asgi2.py", line 7, in __call__
    await instance(receive, send)
  File "/usr/local/lib/python3.8/site-packages/channels/sessions.py", line 183, in __call__
    return await self.inner(receive, self.send)
  File "/usr/local/lib/python3.8/site-packages/channels/middleware.py", line 41, in coroutine_call
    await inner_instance(receive, send)
  File "/usr/local/lib/python3.8/site-packages/channels/consumer.py", line 58, in __call__
    await await_many_dispatch(
  File "/usr/local/lib/python3.8/site-packages/channels/utils.py", line 51, in await_many_dispatch
    await dispatch(result)
  File "/usr/local/lib/python3.8/site-packages/channels/consumer.py", line 73, in dispatch
    await handler(message)
  File "/usr/local/lib/python3.8/site-packages/channels/generic/websocket.py", line 240, in websocket_disconnect
    await self.disconnect(message["code"])
  File "/app/src/apps/competitions/consumers.py", line 65, in disconnect
    await self.close()
  File "/usr/local/lib/python3.8/site-packages/channels/generic/websocket.py", line 226, in close
    await super().send({"type": "websocket.close"})
  File "/usr/local/lib/python3.8/site-packages/channels/consumer.py", line 81, in send
    await self.base_send(message)
  File "/usr/local/lib/python3.8/site-packages/channels/sessions.py", line 236, in send
    return await self.real_send(message)
  File "/usr/local/lib/python3.8/site-packages/uvicorn/protocols/websockets/websockets_impl.py", line 222, in asgi_send
    await self.close(code)
  File "/usr/local/lib/python3.8/site-packages/websockets/protocol.py", line 697, in close
    await asyncio.wait_for(
  File "/usr/local/lib/python3.8/asyncio/tasks.py", line 475, in wait_for
    fut = ensure_future(fut, loop=loop)
  File "/usr/local/lib/python3.8/asyncio/tasks.py", line 684, in ensure_future
    raise TypeError('An asyncio.Future, a coroutine or an awaitable is '
TypeError: An asyncio.Future, a coroutine or an awaitable is required

The [...] replaces a hexadecimal number which I'm not sure should be kept private or not so I removed it for now

@Didayolo
Copy link
Member

Didayolo commented Jun 7, 2024

Nice investigation. I wonder why after a first failure then the next submissions fail too.

@ObadaS
Copy link
Collaborator

ObadaS commented Jun 7, 2024

And the internal server error also only affects submissions. It seems like every other part of the website works correctly except for this when it happens.

Is it possible to get more information from Django in the log concerning the "Internal Server Error" ? I feel like the real error is being hidden, or at least that it's not clear enough with the logs we have.

@ObadaS
Copy link
Collaborator

ObadaS commented Nov 14, 2024

Hello, I think I have finally found what is causing this issue.

I managed to get some more logs after activating the DEBUG log level on Django and waiting for the problem to appear again.

When things works well, this appears in the log every time a worker connects to Django to update a submission. (I removed the submission_id value)

Creating tcp connection to ('redis', 6379)
server > Frame(fin=True, opcode=1, data=b'{"type": "message", "submission_id": , "data": {"kind": "status_update", "status": "Running"}}', rsv1=False, rsv2=False, rsv3=False)

When the problem happens, this changes a bit, becoming:

Creating tcp connection to ('redis', 6379)
Cancelling waiter (<Future pending cb=[<TaskWakeupMethWrapper object at 0x72656d5ec580>()]>, [None, None])
Internal Server Error: /api/submissions/

After some research, I found this issue : django/channels_redis#279 (comment)
This is for a package that we are actually using in the project.

Basically, the connection to Redis does not get closed properly. Eventually, no more connection can be created.
The comment says the issue appeared in aioredis 1.3.x (we are using aioredis 1.3.1 through channels-redis 3.2.0).

So this will happen more quickly the more submissions we have, having to restart Django to temporarily reset the connections number.

Solution

Update the channels-redis to version 4.0.0 minimum, which has the aioredis 2.0 that fixes the issue.

However, this does not seem to be as easy. I tried to update the packages manually on my PC, but it seems like one of the required package, django-redis-cache is not maintained anymore, but is needed when trying to upgrade our packages to use channels-redis 4.0.0. It probably has been migrated to another name or absorbed by another package, so we will need to investigate a bit to successfully upgrade the package and fix the problem once and for all (hopefully)

@bbearce
Copy link
Collaborator

bbearce commented Dec 2, 2024

django-redis-cache has a requirement for aioredis but we don't use it and channels-redis doesn't need aioredis anymore (once 4.0.0). So I removed django-redis-cache locally and this automatically removed aioredis too.

aioredis got absorbed by redis-py...this is the general redis package these days. Redis was updated to 5.2.0 locally to test.

channels-redis dependency list:

(codabench_3.9.20) bbearce@pop-os:~/Documents/codabench$ poetry show channels-redis
 name         : channels-redis                                 
 version      : 4.0.0                                          
 description  : Redis-backed ASGI channel layer implementation 

dependencies
 - asgiref >=3.2.10,<4
 - channels *
 - msgpack >=1.0,<2.0
 - redis >=4.2.0

Also, this is a hunch but I don't think we were using django-redis-cache. Reason: CACHES in codabench/src/settings/base.py shows a backend of django_redis which is "django-redis = "4.12.1" from our pyproject.toml. Also there is this post I found:

https://stackoverflow.com/questions/21932097/difference-between-django-redis-cache-and-django-redis-for-redis-caching-with-dj

Changes captured in PR: 1688

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Enhancement Feature suggestions and improvements P1 High priority, but NOT a current blocker
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants