-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remote workers not receiving submissions (private queue) #1473
Comments
The public queue seems to be working; but I cannot use it for our competition because of the run time limit. |
I am now getting the following output logs:
|
Hi, It seems to be a general problem. I am not able to compute submissions on my custom queue. The difference is that I just have not logs in my case. |
@ObadaS Do you think this problem is happening since the upgrade of Caddy? It is a very concerning issue, we need to solve it asap. |
Caddy is not used for workers since it only supports the HTTP(s) protocol and not AMQP that RabbitMQ uses, so I don't think it's connected to the Caddy upgrade |
@johanneskruse Hello, have you tried creating a new worker on a fresh machine/virtual machine? Also, just to make sure, you are using this command to launch the worker, without changing anything ?
|
@ObadaS I attached the worker wk8 to a custom queue named "TMP", and I confirm this is not working. I added you as an admin of the queue if you want to do some tests. |
@johanneskruse The problem should be fixed. @Didayolo I got some logs that we can analyze later for this weird problem. I tried restarting the RabbitMQ container alone, but it did not fix the problem. It only got fixed after I restart all the containers (but I suspect that the problem might be coming from the site_worker container, we can try restarting it only, if the problem happens again in the future) |
@ObadaS - I can confirm that it is indeed working on my end. Thank you both for your swift actions! |
The remote workers are now receiving jobs and running; however, before, a job was finished within 3-4 hours. Now jobs are running for more than 8 hours, and some haven't even stopped yet. Nothing has changed on my end. Is there any explanation or something that might fix this? |
That is weird indeed. I haven't notice any difference on my side (another queue and competition). It seems that the problem with receiving submissions happened again. I'm restarting the service again. |
Systems are still down on my end |
Hi all, We also have something similar as what is described here as "long running time" where the status is not updated for very long after scoring is finished and submission is displayed as "running" (but this is not a consistent issue and would need a more in depth look on our end). |
Coming back from week-end, I'll try to get some logs to make progress on the issue, then restart the service. |
@ObadaS Do you think these logs have something to do with the problem? Could the new version of Caddy be problematic? codabench-caddy-1 | 2024/06/17 07:25:35.888 ERROR http.log.access.log0 handled request {"request": {"remote_ip": "129.175.8.21", "remote_port": "57760", "client_ip": "129.175.8.21", "proto": "HTTP/1.1", "method": "GET", "host": "www.codabench.org", "uri": "/submission_output/", "headers": {"Origin": ["https://www.codabench.org"], "X-Forwarded-Proto": ["wss"], "Connection": ["Upgrade"], "Cookie": [], "Sec-Websocket-Version": ["13"], "X-Forwarded-Server": ["02dfc7840419"], "User-Agent": ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"], "Accept-Encoding": ["gzip, deflate, br, zstd"], "Cache-Control": ["no-cache"], "Sec-Websocket-Extensions": ["permessage-deflate; client_max_window_bits"], "X-Forwarded-For": ["154.178.176.10"], "X-Forwarded-Port": ["443"], "X-Real-Ip": ["154.178.176.10"], "X-Forwarded-Host": ["www.codabench.org"], "Accept-Language": ["en-US,en;q=0.9"], "Pragma": ["no-cache"], "Sec-Websocket-Key": ["l88AD5IfsEn+5eAS4XyPEA=="], "Upgrade": ["websocket"]}}, "bytes_read": 0, "user_id": "", "duration": 0.007179748, "size": 0, "status": 403, "resp_headers": {"Content-Type": ["text/plain"], "Date": ["Mon, 17 Jun 2024 07:25:35 GMT"], "Content-Length": ["0"]}} codabench-caddy-1 | 2024/06/17 07:25:11.905 ERROR http.log.access.log0 handled request {"request": {"remote_ip": "129.175.8.21", "remote_port": "57760", "client_ip": "129.175.8.21", "proto": "HTTP/1.1", "method": "GET", "host": "www.codabench.org", "uri": "/submission_output/", "headers": {"Cookie": [], "X-Real-Ip": ["154.178.176.10"], "Accept-Encoding": ["gzip, deflate, br, zstd"], "Sec-Websocket-Version": ["13"], "X-Forwarded-For": ["154.178.176.10"], "X-Forwarded-Host": ["www.codabench.org"], "Pragma": ["no-cache"], "Accept-Language": ["en-US,en;q=0.9"], "Sec-Websocket-Extensions": ["permessage-deflate; client_max_window_bits"], "Upgrade": ["websocket"], "X-Forwarded-Port": ["443"], "Cache-Control": ["no-cache"], "Connection": ["Upgrade"], "Origin": ["https://www.codabench.org"], "Sec-Websocket-Key": ["wkZYbRNKsIevR63IG0R+iQ=="], "X-Forwarded-Proto": ["wss"], "X-Forwarded-Server": ["02dfc7840419"], "User-Agent": ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"]}}, "bytes_read": 0, "user_id": "", "duration": 0.009992954, "size": 0, "status": 403, "resp_headers": {"Content-Length": ["0"], "Content-Type": ["text/plain"], "Date": ["Mon, 17 Jun 2024 07:25:11 GMT"]}} codabench-django-1 | [2024-06-17 07:25:11 +0000] [11] [ERROR] Exception in ASGI application
codabench-django-1 | Traceback (most recent call last):
codabench-django-1 | File "/usr/local/lib/python3.8/site-packages/uvicorn/protocols/websockets/websockets_impl.py", line 157, in run_asgi
codabench-django-1 | result = await self.app(self.scope, self.asgi_receive, self.asgi_send)
codabench-django-1 | File "/usr/local/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 45, in __call__
codabench-django-1 | return await self.app(scope, receive, send)
codabench-django-1 | File "/usr/local/lib/python3.8/site-packages/uvicorn/middleware/asgi2.py", line 7, in __call__
codabench-django-1 | await instance(receive, send)
codabench-django-1 | File "/usr/local/lib/python3.8/site-packages/channels/sessions.py", line 183, in __call__
codabench-django-1 | return await self.inner(receive, self.send)
codabench-django-1 | File "/usr/local/lib/python3.8/site-packages/channels/middleware.py", line 41, in coroutine_call
codabench-django-1 | await inner_instance(receive, send)
codabench-django-1 | File "/usr/local/lib/python3.8/site-packages/channels/consumer.py", line 58, in __call__
codabench-django-1 | await await_many_dispatch(
codabench-django-1 | File "/usr/local/lib/python3.8/site-packages/channels/utils.py", line 51, in await_many_dispatch
codabench-django-1 | await dispatch(result)
codabench-django-1 | File "/usr/local/lib/python3.8/site-packages/channels/consumer.py", line 73, in dispatch
codabench-django-1 | await handler(message)
codabench-django-1 | File "/usr/local/lib/python3.8/site-packages/channels/generic/websocket.py", line 240, in websocket_disconnect
codabench-django-1 | await self.disconnect(message["code"])
codabench-django-1 | File "/app/src/apps/competitions/consumers.py", line 65, in disconnect
codabench-django-1 | await self.close()
codabench-django-1 | File "/usr/local/lib/python3.8/site-packages/channels/generic/websocket.py", line 226, in close
codabench-django-1 | await super().send({"type": "websocket.close"})
codabench-django-1 | File "/usr/local/lib/python3.8/site-packages/channels/consumer.py", line 81, in send
codabench-django-1 | await self.base_send(message)
codabench-django-1 | File "/usr/local/lib/python3.8/site-packages/channels/sessions.py", line 236, in send
codabench-django-1 | return await self.real_send(message)
codabench-django-1 | File "/usr/local/lib/python3.8/site-packages/uvicorn/protocols/websockets/websockets_impl.py", line 234, in asgi_send
codabench-django-1 | raise RuntimeError(msg % message_type)
codabench-django-1 | RuntimeError: Unexpected ASGI message 'websocket.close', after sending 'websocket.close'.
codabench-django-1 | [2024-06-17 07:25:12 +0000] [12] [INFO] ('192.168.48.7', 54656) - "WebSocket /submission_output/" 403 |
@Didayolo I don't think this is a Caddy problem. Those errors don't seem to have anything to do with this problem. I think the problem comes from the site_worker container. codabench-site_worker-1 | [2024-06-17 05:36:48,866: ERROR/ForkPoolWorker-2] Task competitions.tasks._run_submission[...] raised unexpected: PreconditionFailed(406, "PRECONDITION_FAILED - inequivalent arg 'x-max-priority' for queue 'compute-worker' in vhost '988f0756-7c60-40e6-9e6b-40d830125d83': received none but current is the value '10' of type 'signedint'", (50, 10), 'Queue.declare')
codabench-site_worker-1 | Traceback (most recent call last):
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 382, in trace_task
codabench-site_worker-1 | R = retval = fun(*args, **kwargs)
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 641, in __protected_call__
codabench-site_worker-1 | return self.run(*args, **kwargs)
codabench-site_worker-1 | File "/app/src/apps/competitions/tasks.py", line 318, in _run_submission
codabench-site_worker-1 | _send_to_compute_worker(submission, is_scoring)
codabench-site_worker-1 | File "/app/src/apps/competitions/tasks.py", line 199, in _send_to_compute_worker
codabench-site_worker-1 | task = celery_app.send_task(
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/celery/app/base.py", line 745, in send_task
codabench-site_worker-1 | amqp.send_task_message(P, name, message, **options)
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/celery/app/amqp.py", line 543, in send_task_message
codabench-site_worker-1 | ret = producer.publish(
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 178, in publish
codabench-site_worker-1 | return _publish(
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/kombu/connection.py", line 533, in _ensured
codabench-site_worker-1 | return fun(*args, **kwargs)
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 194, in _publish
codabench-site_worker-1 | [maybe_declare(entity) for entity in declare]
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 194, in <listcomp>
codabench-site_worker-1 | [maybe_declare(entity) for entity in declare]
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 102, in maybe_declare
codabench-site_worker-1 | return maybe_declare(entity, self.channel, retry, **retry_policy)
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/kombu/common.py", line 121, in maybe_declare
codabench-site_worker-1 | return _maybe_declare(entity, channel)
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/kombu/common.py", line 161, in _maybe_declare
codabench-site_worker-1 | entity.declare(channel=channel)
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/kombu/entity.py", line 611, in declare
codabench-site_worker-1 | self._create_queue(nowait=nowait, channel=channel)
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/kombu/entity.py", line 620, in _create_queue
codabench-site_worker-1 | self.queue_declare(nowait=nowait, passive=False, channel=channel)
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/kombu/entity.py", line 648, in queue_declare
codabench-site_worker-1 | ret = channel.queue_declare(
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/amqp/channel.py", line 1148, in queue_declare
codabench-site_worker-1 | return queue_declare_ok_t(*self.wait(
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/amqp/abstract_channel.py", line 88, in wait
codabench-site_worker-1 | self.connection.drain_events(timeout=timeout)
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 508, in drain_events
codabench-site_worker-1 | while not self.blocking_read(timeout):
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 514, in blocking_read
codabench-site_worker-1 | return self.on_inbound_frame(frame)
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/amqp/method_framing.py", line 55, in on_frame
codabench-site_worker-1 | callback(channel, method_sig, buf, None)
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 520, in on_inbound_method
codabench-site_worker-1 | return self.channels[channel_id].dispatch_method(
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/amqp/abstract_channel.py", line 145, in dispatch_method
codabench-site_worker-1 | listener(*args)
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/amqp/channel.py", line 279, in _on_close
codabench-site_worker-1 | raise error_for_code(
codabench-site_worker-1 | amqp.exceptions.PreconditionFailed: Queue.declare: (406) PRECONDITION_FAILED - inequivalent arg 'x-max-priority' for queue 'compute-worker' in vhost '988f0756-7c60-40e6-9e6b-40d830125d83': received none but current is the value '10' of type 'signedint' Caddy does not redirect the requests to RabbitMQ since it does not support other protocols than HTTP(S). |
Ah yes that is what we wanted to try. I couldn't remember and did not wanted to let the queue blocked for too much time. Thank you very much for you input. So we can see that the error message is consistent. |
It is running again after the reset - thank you! |
@johanneskruse May I ask how many workers you have on your private queue? Apart from the bug reported above, it seems that your queue does not have enough resources to handle the load of your competition. |
@Didayolo - the competition is coming to an end and we are experiencing a high-level of submissions. At the moment we 20 workers; but I can see a few of them have crashed and I have to reset them. Is there a limit on amount of workers? |
According to this issue amqp-node/amqplib#165, the error: codabench-site_worker-1 | [2024-06-17 05:36:48,866: ERROR/ForkPoolWorker-2] Task competitions.tasks._run_submission[...]
...
codabench-site_worker-1 | listener(*args)
codabench-site_worker-1 | File "/usr/local/lib/python3.8/site-packages/amqp/channel.py", line 279, in _on_close
codabench-site_worker-1 | raise error_for_code(
codabench-site_worker-1 | amqp.exceptions.PreconditionFailed: Queue.declare: (406) PRECONDITION_FAILED - inequivalent arg 'x-max-priority' for queue 'compute-worker' in vhost '988f0756-7c60-40e6-9e6b-40d830125d83': received none but current is the value '10' of type 'signedint' from @Didayolo logs is due to a change in the queue configuration. It seems that at some point the queue is created with 'x-max-priority=10', but at some latter moment is defined without this maximum priority value. Does it have sense to you? |
@johanneskruse : No, there is not limit.
@xbaro : yes, I've seen this explanation. However, everywhere in the code we set this value to 10, I don't know how exactly sometimes it receives a value of
codabench/src/celery_config.py Line 15 in 9c5a545
Maybe it is declared somewhere else? |
@johanneskruse Done. Interesting investigation of this problem here: The problem happens after cancelling a submission |
@Didayolo - workers seem to be disconnected. |
@johanneskruse Hello, I fixed it for now. |
@johanneskruse Hi, the good news is that we found a fix! We will deploy it soon in production. |
Hi,
I'm running a competition and my remote workers has stopped receving any submissions.
I don't have any error logs from them. I might be related to #1446.
Best,
Johannes
The text was updated successfully, but these errors were encountered: