-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Queue congestion (public queue) #1446
Comments
TODO P1 development:
The "Submitted" count depends on the queue of the competition (including for default queue). Problem: sometimes it's stuck on "Submitted", even after service restart, so it may be considered congested indefinitely. Suggestions:
Bug report: all queues blockedApparently, when one queue is congested, then all queues get stuck. @ihsaan-ullah Can we confirm the existence of this bug? |
There is actually a bug, not simply overload of submissions. When the bug happens, the workers stopped being able to communicate their status with the main server because of a 500 error. @ObadaS Do you have some logs to complete this diagnosis? |
On going: fix server time (5 minutes late) To check:
Set time:
|
[2024-05-28 07:52:15,862: INFO/ForkPoolWorker-155] Updating submission @ https://www.codabench.org/api/submissions/65174/ with data = {'status': 'Preparing', 'status_details': None, 'secret': ''}
[2024-05-28 07:52:16,003: INFO/ForkPoolWorker-155] Submission patch failed with status = 500, and response =
b'<h1>Server Error (500)</h1>'
[2024-05-28 07:52:16,003: INFO/ForkPoolWorker-155] Updating submission @ https://www.codabench.org/api/submissions/65174/ with data = {'status': 'Failed', 'status_details': 'Failure updating submission data.', 'secret': ''}
[2024-05-28 07:52:16,116: INFO/ForkPoolWorker-155] Submission patch failed with status = 500, and response =
b'<h1>Server Error (500)</h1>'
[2024-05-28 07:52:16,116: INFO/ForkPoolWorker-155] Destroying submission temp dir: /codabench/tmpuz0or1fz
[2024-05-28 07:52:16,122: ERROR/ForkPoolWorker-155] Task compute_worker_run[...] raised unexpected: SubmissionException('Failure updating submission data.')
Traceback (most recent call last):
File "/compute_worker.py", line 115, in run_wrapper
run.prepare()
File "/compute_worker.py", line 751, in prepare
self._update_status(STATUS_PREPARING)
File "/compute_worker.py", line 359, in _update_status
self._update_submission(data)
File "/compute_worker.py", line 342, in _update_submission
raise SubmissionException("Failure updating submission data.")
compute_worker.SubmissionException: Failure updating submission data.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 385, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 650, in __protected_call__
return self.run(*args, **kwargs)
File "/compute_worker.py", line 123, in run_wrapper
run._update_status(STATUS_FAILED, str(e))
File "/compute_worker.py", line 359, in _update_status
self._update_submission(data)
File "/compute_worker.py", line 342, in _update_submission
raise SubmissionException("Failure updating submission data.")
compute_worker.SubmissionException: Failure updating submission data. I removed the |
Conjointly, users reported that their custom workers stopped receiving submissions. It worked again after we cured the production server (#1455). |
It happened again, a Django container restart fixed the problem as expected. [2024-06-06 20:34:28 +0000] [13] [INFO] ('172.31.0.5', 55760) - "WebSocket /submission_output/" [accepted]
Internal Server Error: /api/submissions/68303/
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
response = get_response(request)
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 115, in _get_response
response = self.process_exception_by_middleware(e, request)
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 113, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/usr/local/lib/python3.8/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
return view_func(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/viewsets.py", line 116, in view
return self.dispatch(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 495, in dispatch
response = self.handle_exception(exc)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 455, in handle_exception
self.raise_uncaught_exception(exc)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 492, in dispatch
response = handler(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 84, in partial_update
return self.update(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 70, in update
self.perform_update(serializer)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 80, in perform_update
serializer.save()
File "/usr/local/lib/python3.8/site-packages/rest_framework/serializers.py", line 209, in save
self.instance = self.update(self.instance, validated_data)
File "/app/src/apps/api/serializers/submissions.py", line 228, in update
loop.run_until_complete(channel_layer.group_send(f"submission_listening_{submission.owner.pk}", {
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.8/site-packages/channels_redis/core.py", line 649, in group_send
await connection.zremrangebyscore(
aioredis.errors.ConnectionClosedError: Connection has been closed by server
Internal Server Error: /api/submissions/68303/
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
response = get_response(request)
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 115, in _get_response
response = self.process_exception_by_middleware(e, request)
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 113, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/usr/local/lib/python3.8/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
return view_func(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/viewsets.py", line 116, in view
return self.dispatch(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 495, in dispatch
response = self.handle_exception(exc)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 455, in handle_exception
self.raise_uncaught_exception(exc)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 492, in dispatch
response = handler(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 84, in partial_update
return self.update(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 70, in update
self.perform_update(serializer)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 80, in perform_update
serializer.save()
File "/usr/local/lib/python3.8/site-packages/rest_framework/serializers.py", line 209, in save
self.instance = self.update(self.instance, validated_data)
File "/app/src/apps/api/serializers/submissions.py", line 228, in update
loop.run_until_complete(channel_layer.group_send(f"submission_listening_{submission.owner.pk}", {
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.8/site-packages/channels_redis/core.py", line 649, in group_send
await connection.zremrangebyscore(
aioredis.errors.ConnectionClosedError: Connection has been closed by server
Internal Server Error: /api/submissions/68303/
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
response = get_response(request)
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 115, in _get_response
response = self.process_exception_by_middleware(e, request)
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 113, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/usr/local/lib/python3.8/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
return view_func(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/viewsets.py", line 116, in view
return self.dispatch(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 495, in dispatch
response = self.handle_exception(exc)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 455, in handle_exception
self.raise_uncaught_exception(exc)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 492, in dispatch
response = handler(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 84, in partial_update
return self.update(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 70, in update
self.perform_update(serializer)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 80, in perform_update
serializer.save()
File "/usr/local/lib/python3.8/site-packages/rest_framework/serializers.py", line 209, in save
self.instance = self.update(self.instance, validated_data)
File "/app/src/apps/api/serializers/submissions.py", line 228, in update
loop.run_until_complete(channel_layer.group_send(f"submission_listening_{submission.owner.pk}", {
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.8/site-packages/channels_redis/core.py", line 649, in group_send
await connection.zremrangebyscore(
aioredis.errors.ConnectionClosedError: Connection has been closed by server
Internal Server Error: /api/submissions/68303/
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
response = get_response(request)
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 115, in _get_response
response = self.process_exception_by_middleware(e, request)
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 113, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/usr/local/lib/python3.8/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
return view_func(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/viewsets.py", line 116, in view
return self.dispatch(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 495, in dispatch
response = self.handle_exception(exc)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 455, in handle_exception
self.raise_uncaught_exception(exc)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 492, in dispatch
response = handler(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 84, in partial_update
return self.update(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 70, in update
self.perform_update(serializer)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 80, in perform_update
serializer.save()
File "/usr/local/lib/python3.8/site-packages/rest_framework/serializers.py", line 209, in save
self.instance = self.update(self.instance, validated_data)
File "/app/src/apps/api/serializers/submissions.py", line 228, in update
loop.run_until_complete(channel_layer.group_send(f"submission_listening_{submission.owner.pk}", {
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.8/site-packages/channels_redis/core.py", line 649, in group_send
await connection.zremrangebyscore(
aioredis.errors.ConnectionClosedError: Connection has been closed by server This is the first submission that is having trouble updating its Status from the Worker. After this error starts, all future updates bugs out with the same error, even tasks that were launched hours before the bug happened. While it seems to be a reddis issue, the logs on the reddis container does not show any abnormality. While looking at the Django logs, I also found some new errors (not sure if they are related to this issue so I'll add them for now and remove them later if they end up not being relevent) [2024-06-06 20:32:05 +0000] [12] [INFO] ('172.31.0.5', 52770) - "WebSocket /submission_output/" [accepted]
Error in opening handshake
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/websockets/server.py", line 130, in handler
path = await self.handshake(
File "/usr/local/lib/python3.8/site-packages/websockets/server.py", line 584, in handshake
self.write_http_response(http.HTTPStatus.SWITCHING_PROTOCOLS, response_headers)
File "/usr/local/lib/python3.8/site-packages/websockets/server.py", line 264, in write_http_response
self.transport.write(response.encode())
File "uvloop/handles/stream.pyx", line 674, in uvloop.loop.UVStream.write
File "uvloop/handles/handle.pyx", line 159, in uvloop.loop.UVHandle._ensure_alive
RuntimeError: unable to perform operation on <TCPTransport closed=True reading=False [...] >; the handler is closed
[2024-06-06 20:32:05 +0000] [12] [ERROR] Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/uvicorn/protocols/websockets/websockets_impl.py", line 157, in run_asgi
result = await self.app(self.scope, self.asgi_receive, self.asgi_send)
File "/usr/local/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 45, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/uvicorn/middleware/asgi2.py", line 7, in __call__
await instance(receive, send)
File "/usr/local/lib/python3.8/site-packages/channels/sessions.py", line 183, in __call__
return await self.inner(receive, self.send)
File "/usr/local/lib/python3.8/site-packages/channels/middleware.py", line 41, in coroutine_call
await inner_instance(receive, send)
File "/usr/local/lib/python3.8/site-packages/channels/consumer.py", line 58, in __call__
await await_many_dispatch(
File "/usr/local/lib/python3.8/site-packages/channels/utils.py", line 51, in await_many_dispatch
await dispatch(result)
File "/usr/local/lib/python3.8/site-packages/channels/consumer.py", line 73, in dispatch
await handler(message)
File "/usr/local/lib/python3.8/site-packages/channels/generic/websocket.py", line 240, in websocket_disconnect
await self.disconnect(message["code"])
File "/app/src/apps/competitions/consumers.py", line 65, in disconnect
await self.close()
File "/usr/local/lib/python3.8/site-packages/channels/generic/websocket.py", line 226, in close
await super().send({"type": "websocket.close"})
File "/usr/local/lib/python3.8/site-packages/channels/consumer.py", line 81, in send
await self.base_send(message)
File "/usr/local/lib/python3.8/site-packages/channels/sessions.py", line 236, in send
return await self.real_send(message)
File "/usr/local/lib/python3.8/site-packages/uvicorn/protocols/websockets/websockets_impl.py", line 222, in asgi_send
await self.close(code)
File "/usr/local/lib/python3.8/site-packages/websockets/protocol.py", line 697, in close
await asyncio.wait_for(
File "/usr/local/lib/python3.8/asyncio/tasks.py", line 475, in wait_for
fut = ensure_future(fut, loop=loop)
File "/usr/local/lib/python3.8/asyncio/tasks.py", line 684, in ensure_future
raise TypeError('An asyncio.Future, a coroutine or an awaitable is '
TypeError: An asyncio.Future, a coroutine or an awaitable is required The [...] replaces a hexadecimal number which I'm not sure should be kept private or not so I removed it for now |
Nice investigation. I wonder why after a first failure then the next submissions fail too. |
And the internal server error also only affects submissions. It seems like every other part of the website works correctly except for this when it happens. Is it possible to get more information from Django in the log concerning the "Internal Server Error" ? I feel like the real error is being hidden, or at least that it's not clear enough with the logs we have. |
Hello, I think I have finally found what is causing this issue. I managed to get some more logs after activating the DEBUG log level on Django and waiting for the problem to appear again. When things works well, this appears in the log every time a worker connects to Django to update a submission. (I removed the Creating tcp connection to ('redis', 6379)
server > Frame(fin=True, opcode=1, data=b'{"type": "message", "submission_id": , "data": {"kind": "status_update", "status": "Running"}}', rsv1=False, rsv2=False, rsv3=False) When the problem happens, this changes a bit, becoming: Creating tcp connection to ('redis', 6379)
Cancelling waiter (<Future pending cb=[<TaskWakeupMethWrapper object at 0x72656d5ec580>()]>, [None, None])
Internal Server Error: /api/submissions/ After some research, I found this issue : django/channels_redis#279 (comment) Basically, the connection to Redis does not get closed properly. Eventually, no more connection can be created. So this will happen more quickly the more submissions we have, having to restart Django to temporarily reset the connections number. SolutionUpdate the However, this does not seem to be as easy. I tried to update the packages manually on my PC, but it seems like one of the required package, |
Also, this is a hunch but I don't think we were using Changes captured in PR: 1688 |
We frequently notice that submissions get stuck in the queue. This happens with RabbitMQ when incoming messages rate is greater then outgoing messages.
How can we solve this issue?
The text was updated successfully, but these errors were encountered: