Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote workers not receiving submissions (private queue) #1473

Closed
johanneskruse opened this issue Jun 9, 2024 · 30 comments · Fixed by #1503
Closed

Remote workers not receiving submissions (private queue) #1473

johanneskruse opened this issue Jun 9, 2024 · 30 comments · Fixed by #1503
Labels
Bug P1 High priority, but NOT a current blocker

Comments

@johanneskruse
Copy link

Hi,

I'm running a competition and my remote workers has stopped receving any submissions.

image

I don't have any error logs from them. I might be related to #1446.

Best,
Johannes

@johanneskruse
Copy link
Author

The public queue seems to be working; but I cannot use it for our competition because of the run time limit.

@johanneskruse
Copy link
Author

I am now getting the following output logs:

$ docker logs -f compute_worker
/usr/local/lib/python3.8/site-packages/celery/platforms.py:800: RuntimeWarning: You're running the worker with superuser privileges: this is
absolutely not recommended!

Please specify a different user using the --uid option.

User information: uid=0 euid=0 gid=0 egid=0

  warnings.warn(RuntimeWarning(ROOT_DISCOURAGED.format(
 
 -------------- compute-worker@efe5103be5d4 v4.4.0 (cliffs)
--- ***** ----- 
-- ******* ---- Linux-5.15.0-1062-aws-x86_64-with-glibc2.34 2024-06-09 19:02:33
- *** --- * --- 
- ** ---------- [config]
- ** ---------- .> app:         __main__:0x7f7a65017a30
- ** ---------- .> transport:   amqp://63a35e45-cb28-4eed-9c2c-af8072bf9d9c:**@www.codabench.org:5672/4d4c54a3-6e7f-418a-8d6b-36345ea8b37c
- ** ---------- .> results:     disabled://
- *** --- * --- .> concurrency: 1 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** ----- 
 -------------- [queues]
                .> compute-worker   exchange=compute-worker(direct) key=compute-worker
                

[tasks]
  . compute_worker_run

[2024-06-09 19:02:33,405: INFO/MainProcess] Connected to amqp://63a35e45-cb28-4eed-9c2c-af8072bf9d9c:**@www.codabench.org:5672/4d4c54a3-6e7f-418a-8d6b-36345ea8b37c
[2024-06-09 19:02:33,523: INFO/MainProcess] mingle: searching for neighbors
[2024-06-09 19:02:34,879: INFO/MainProcess] mingle: all alone
[2024-06-09 19:02:35,079: CRITICAL/MainProcess] Unrecoverable error: PreconditionFailed(406, "PRECONDITION_FAILED - inequivalent arg 'x-max-priority' for queue 'compute-worker' in vhost '4d4c54a3-6e7f-418a-8d6b-36345ea8b37c': received the value '10' of type 'signedint' but current is none", (50, 10), 'Queue.declare')
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/celery/worker/worker.py", line 205, in start
    self.blueprint.start(self)
  File "/usr/local/lib/python3.8/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/usr/local/lib/python3.8/site-packages/celery/bootsteps.py", line 369, in start
    return self.obj.start()
  File "/usr/local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 318, in start
    blueprint.start(self)
  File "/usr/local/lib/python3.8/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/usr/local/lib/python3.8/site-packages/celery/worker/consumer/tasks.py", line 40, in start
    c.task_consumer = c.app.amqp.TaskConsumer(
  File "/usr/local/lib/python3.8/site-packages/celery/app/amqp.py", line 301, in TaskConsumer
    return self.Consumer(
  File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 386, in __init__
    self.revive(self.channel)
  File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 408, in revive
    self.declare()
  File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 421, in declare
    queue.declare()
  File "/usr/local/lib/python3.8/site-packages/kombu/entity.py", line 611, in declare
    self._create_queue(nowait=nowait, channel=channel)
  File "/usr/local/lib/python3.8/site-packages/kombu/entity.py", line 620, in _create_queue
    self.queue_declare(nowait=nowait, passive=False, channel=channel)
  File "/usr/local/lib/python3.8/site-packages/kombu/entity.py", line 648, in queue_declare
    ret = channel.queue_declare(
  File "/usr/local/lib/python3.8/site-packages/amqp/channel.py", line 1148, in queue_declare
    return queue_declare_ok_t(*self.wait(
  File "/usr/local/lib/python3.8/site-packages/amqp/abstract_channel.py", line 88, in wait
    self.connection.drain_events(timeout=timeout)
  File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 508, in drain_events
    while not self.blocking_read(timeout):
  File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 514, in blocking_read
    return self.on_inbound_frame(frame)
  File "/usr/local/lib/python3.8/site-packages/amqp/method_framing.py", line 55, in on_frame
    callback(channel, method_sig, buf, None)
  File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 520, in on_inbound_method
    return self.channels[channel_id].dispatch_method(
  File "/usr/local/lib/python3.8/site-packages/amqp/abstract_channel.py", line 145, in dispatch_method
    listener(*args)
  File "/usr/local/lib/python3.8/site-packages/amqp/channel.py", line 279, in _on_close
    raise error_for_code(
amqp.exceptions.PreconditionFailed: Queue.declare: (406) PRECONDITION_FAILED - inequivalent arg 'x-max-priority' for queue 'compute-worker' in vhost '4d4c54a3-6e7f-418a-8d6b-36345ea8b37c': received the value '10' of type 'signedint' but current is none

@Didayolo
Copy link
Member

Hi,

It seems to be a general problem. I am not able to compute submissions on my custom queue.

The difference is that I just have not logs in my case.

@Didayolo Didayolo added Bug P1 High priority, but NOT a current blocker labels Jun 11, 2024
@Didayolo
Copy link
Member

@ObadaS Do you think this problem is happening since the upgrade of Caddy?

It is a very concerning issue, we need to solve it asap.

@ObadaS
Copy link
Collaborator

ObadaS commented Jun 11, 2024

@ObadaS Do you think this problem is happening since the upgrade of Caddy?

Caddy is not used for workers since it only supports the HTTP(s) protocol and not AMQP that RabbitMQ uses, so I don't think it's connected to the Caddy upgrade

@ObadaS
Copy link
Collaborator

ObadaS commented Jun 11, 2024

@johanneskruse Hello, have you tried creating a new worker on a fresh machine/virtual machine?

Also, just to make sure, you are using this command to launch the worker, without changing anything ?

docker run \
    -v /codabench:/codabench \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -d \
    --env-file .env \
    --name compute_worker \
    --restart unless-stopped \
    --log-opt max-size=50m \
    --log-opt max-file=3 \
    codalab/competitions-v2-compute-worker:latest 

@Didayolo
Copy link
Member

Didayolo commented Jun 11, 2024

@ObadaS I attached the worker wk8 to a custom queue named "TMP", and I confirm this is not working.

I added you as an admin of the queue if you want to do some tests.

@ObadaS
Copy link
Collaborator

ObadaS commented Jun 11, 2024

@johanneskruse The problem should be fixed.

@Didayolo I got some logs that we can analyze later for this weird problem. I tried restarting the RabbitMQ container alone, but it did not fix the problem. It only got fixed after I restart all the containers (but I suspect that the problem might be coming from the site_worker container, we can try restarting it only, if the problem happens again in the future)

@Didayolo Didayolo removed the Blocker label Jun 11, 2024
@johanneskruse
Copy link
Author

@ObadaS - I can confirm that it is indeed working on my end.

Thank you both for your swift actions!

@johanneskruse
Copy link
Author

Hi @Didayolo and @ObadaS.

The remote workers are now receiving jobs and running; however, before, a job was finished within 3-4 hours. Now jobs are running for more than 8 hours, and some haven't even stopped yet. Nothing has changed on my end. Is there any explanation or something that might fix this?

@Didayolo
Copy link
Member

Didayolo commented Jun 12, 2024

That is weird indeed. I haven't notice any difference on my side (another queue and competition).

It seems that the problem with receiving submissions happened again. I'm restarting the service again.

@johanneskruse
Copy link
Author

johanneskruse commented Jun 12, 2024

Yes, it is very strange. I've previously been able to run four remote workers per instance, but I've had to lower the number as it has started to consume more CPU utilization.

Furthermore, some submissions complete without outputting anything, some run forever, and some fail due to:

Screenshot 2024-06-12 at 08 52 12

And some "Failed" but still completed:
image

But some also finish normally; it's hard to tell what's going on, but something seems to be somewhat unstable.

@Didayolo Didayolo changed the title Remote workers not receiving submissions Remote workers not receiving submissions (private queue) Jun 13, 2024
@johanneskruse
Copy link
Author

Hi @Didayolo and @ObadaS,

The remote workers have again stopped receiving jobs. This is the third week in a row it happens around the same time of the week. I'm not sure if it is a coincidence or if there is something more systematic about the timing of the issue.

Screenshot 2024-06-14 at 22 35 01

@johanneskruse
Copy link
Author

johanneskruse commented Jun 17, 2024

Systems are still down on my end

@jpbrunetsx
Copy link

Hi all,
Same problem since about friday 5pm (CET) on our competition/private workers.

We also have something similar as what is described here as "long running time" where the status is not updated for very long after scoring is finished and submission is displayed as "running" (but this is not a consistent issue and would need a more in depth look on our end).

@Didayolo
Copy link
Member

Systems are still down on my end

Coming back from week-end, I'll try to get some logs to make progress on the issue, then restart the service.

@Didayolo
Copy link
Member

Didayolo commented Jun 17, 2024

@ObadaS Do you think these logs have something to do with the problem? Could the new version of Caddy be problematic?

codabench-caddy-1          | 2024/06/17 07:25:35.888	ERROR	http.log.access.log0	handled request	{"request": {"remote_ip": "129.175.8.21", "remote_port": "57760", "client_ip": "129.175.8.21", "proto": "HTTP/1.1", "method": "GET", "host": "www.codabench.org", "uri": "/submission_output/", "headers": {"Origin": ["https://www.codabench.org"], "X-Forwarded-Proto": ["wss"], "Connection": ["Upgrade"], "Cookie": [], "Sec-Websocket-Version": ["13"], "X-Forwarded-Server": ["02dfc7840419"], "User-Agent": ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"], "Accept-Encoding": ["gzip, deflate, br, zstd"], "Cache-Control": ["no-cache"], "Sec-Websocket-Extensions": ["permessage-deflate; client_max_window_bits"], "X-Forwarded-For": ["154.178.176.10"], "X-Forwarded-Port": ["443"], "X-Real-Ip": ["154.178.176.10"], "X-Forwarded-Host": ["www.codabench.org"], "Accept-Language": ["en-US,en;q=0.9"], "Pragma": ["no-cache"], "Sec-Websocket-Key": ["l88AD5IfsEn+5eAS4XyPEA=="], "Upgrade": ["websocket"]}}, "bytes_read": 0, "user_id": "", "duration": 0.007179748, "size": 0, "status": 403, "resp_headers": {"Content-Type": ["text/plain"], "Date": ["Mon, 17 Jun 2024 07:25:35 GMT"], "Content-Length": ["0"]}}
codabench-caddy-1          | 2024/06/17 07:25:11.905	ERROR	http.log.access.log0	handled request	{"request": {"remote_ip": "129.175.8.21", "remote_port": "57760", "client_ip": "129.175.8.21", "proto": "HTTP/1.1", "method": "GET", "host": "www.codabench.org", "uri": "/submission_output/", "headers": {"Cookie": [], "X-Real-Ip": ["154.178.176.10"], "Accept-Encoding": ["gzip, deflate, br, zstd"], "Sec-Websocket-Version": ["13"], "X-Forwarded-For": ["154.178.176.10"], "X-Forwarded-Host": ["www.codabench.org"], "Pragma": ["no-cache"], "Accept-Language": ["en-US,en;q=0.9"], "Sec-Websocket-Extensions": ["permessage-deflate; client_max_window_bits"], "Upgrade": ["websocket"], "X-Forwarded-Port": ["443"], "Cache-Control": ["no-cache"], "Connection": ["Upgrade"], "Origin": ["https://www.codabench.org"], "Sec-Websocket-Key": ["wkZYbRNKsIevR63IG0R+iQ=="], "X-Forwarded-Proto": ["wss"], "X-Forwarded-Server": ["02dfc7840419"], "User-Agent": ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"]}}, "bytes_read": 0, "user_id": "", "duration": 0.009992954, "size": 0, "status": 403, "resp_headers": {"Content-Length": ["0"], "Content-Type": ["text/plain"], "Date": ["Mon, 17 Jun 2024 07:25:11 GMT"]}}
codabench-django-1         | [2024-06-17 07:25:11 +0000] [11] [ERROR] Exception in ASGI application
codabench-django-1         | Traceback (most recent call last):
codabench-django-1         |   File "/usr/local/lib/python3.8/site-packages/uvicorn/protocols/websockets/websockets_impl.py", line 157, in run_asgi
codabench-django-1         |     result = await self.app(self.scope, self.asgi_receive, self.asgi_send)
codabench-django-1         |   File "/usr/local/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 45, in __call__
codabench-django-1         |     return await self.app(scope, receive, send)
codabench-django-1         |   File "/usr/local/lib/python3.8/site-packages/uvicorn/middleware/asgi2.py", line 7, in __call__
codabench-django-1         |     await instance(receive, send)
codabench-django-1         |   File "/usr/local/lib/python3.8/site-packages/channels/sessions.py", line 183, in __call__
codabench-django-1         |     return await self.inner(receive, self.send)
codabench-django-1         |   File "/usr/local/lib/python3.8/site-packages/channels/middleware.py", line 41, in coroutine_call
codabench-django-1         |     await inner_instance(receive, send)
codabench-django-1         |   File "/usr/local/lib/python3.8/site-packages/channels/consumer.py", line 58, in __call__
codabench-django-1         |     await await_many_dispatch(
codabench-django-1         |   File "/usr/local/lib/python3.8/site-packages/channels/utils.py", line 51, in await_many_dispatch
codabench-django-1         |     await dispatch(result)
codabench-django-1         |   File "/usr/local/lib/python3.8/site-packages/channels/consumer.py", line 73, in dispatch
codabench-django-1         |     await handler(message)
codabench-django-1         |   File "/usr/local/lib/python3.8/site-packages/channels/generic/websocket.py", line 240, in websocket_disconnect
codabench-django-1         |     await self.disconnect(message["code"])
codabench-django-1         |   File "/app/src/apps/competitions/consumers.py", line 65, in disconnect
codabench-django-1         |     await self.close()
codabench-django-1         |   File "/usr/local/lib/python3.8/site-packages/channels/generic/websocket.py", line 226, in close
codabench-django-1         |     await super().send({"type": "websocket.close"})
codabench-django-1         |   File "/usr/local/lib/python3.8/site-packages/channels/consumer.py", line 81, in send
codabench-django-1         |     await self.base_send(message)
codabench-django-1         |   File "/usr/local/lib/python3.8/site-packages/channels/sessions.py", line 236, in send
codabench-django-1         |     return await self.real_send(message)
codabench-django-1         |   File "/usr/local/lib/python3.8/site-packages/uvicorn/protocols/websockets/websockets_impl.py", line 234, in asgi_send
codabench-django-1         |     raise RuntimeError(msg % message_type)
codabench-django-1         | RuntimeError: Unexpected ASGI message 'websocket.close', after sending 'websocket.close'.
codabench-django-1         | [2024-06-17 07:25:12 +0000] [12] [INFO] ('192.168.48.7', 54656) - "WebSocket /submission_output/" 403

@ObadaS
Copy link
Collaborator

ObadaS commented Jun 17, 2024

@Didayolo I don't think this is a Caddy problem. Those errors don't seem to have anything to do with this problem. I think the problem comes from the site_worker container.

codabench-site_worker-1  | [2024-06-17 05:36:48,866: ERROR/ForkPoolWorker-2] Task competitions.tasks._run_submission[...] raised unexpected: PreconditionFailed(406, "PRECONDITION_FAILED - inequivalent arg 'x-max-priority' for queue 'compute-worker' in vhost '988f0756-7c60-40e6-9e6b-40d830125d83': received none but current is the value '10' of type 'signedint'", (50, 10), 'Queue.declare')
codabench-site_worker-1  | Traceback (most recent call last):
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 382, in trace_task
codabench-site_worker-1  |     R = retval = fun(*args, **kwargs)
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 641, in __protected_call__
codabench-site_worker-1  |     return self.run(*args, **kwargs)
codabench-site_worker-1  |   File "/app/src/apps/competitions/tasks.py", line 318, in _run_submission
codabench-site_worker-1  |     _send_to_compute_worker(submission, is_scoring)
codabench-site_worker-1  |   File "/app/src/apps/competitions/tasks.py", line 199, in _send_to_compute_worker
codabench-site_worker-1  |     task = celery_app.send_task(
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/celery/app/base.py", line 745, in send_task
codabench-site_worker-1  |     amqp.send_task_message(P, name, message, **options)
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/celery/app/amqp.py", line 543, in send_task_message
codabench-site_worker-1  |     ret = producer.publish(
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 178, in publish
codabench-site_worker-1  |     return _publish(
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/kombu/connection.py", line 533, in _ensured
codabench-site_worker-1  |     return fun(*args, **kwargs)
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 194, in _publish
codabench-site_worker-1  |     [maybe_declare(entity) for entity in declare]
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 194, in <listcomp>
codabench-site_worker-1  |     [maybe_declare(entity) for entity in declare]
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 102, in maybe_declare
codabench-site_worker-1  |     return maybe_declare(entity, self.channel, retry, **retry_policy)
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/kombu/common.py", line 121, in maybe_declare
codabench-site_worker-1  |     return _maybe_declare(entity, channel)
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/kombu/common.py", line 161, in _maybe_declare
codabench-site_worker-1  |     entity.declare(channel=channel)
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/kombu/entity.py", line 611, in declare
codabench-site_worker-1  |     self._create_queue(nowait=nowait, channel=channel)
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/kombu/entity.py", line 620, in _create_queue
codabench-site_worker-1  |     self.queue_declare(nowait=nowait, passive=False, channel=channel)
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/kombu/entity.py", line 648, in queue_declare
codabench-site_worker-1  |     ret = channel.queue_declare(
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/amqp/channel.py", line 1148, in queue_declare
codabench-site_worker-1  |     return queue_declare_ok_t(*self.wait(
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/amqp/abstract_channel.py", line 88, in wait
codabench-site_worker-1  |     self.connection.drain_events(timeout=timeout)
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 508, in drain_events
codabench-site_worker-1  |     while not self.blocking_read(timeout):
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 514, in blocking_read
codabench-site_worker-1  |     return self.on_inbound_frame(frame)
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/amqp/method_framing.py", line 55, in on_frame
codabench-site_worker-1  |     callback(channel, method_sig, buf, None)
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 520, in on_inbound_method
codabench-site_worker-1  |     return self.channels[channel_id].dispatch_method(
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/amqp/abstract_channel.py", line 145, in dispatch_method
codabench-site_worker-1  |     listener(*args)
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/amqp/channel.py", line 279, in _on_close
codabench-site_worker-1  |     raise error_for_code(
codabench-site_worker-1  | amqp.exceptions.PreconditionFailed: Queue.declare: (406) PRECONDITION_FAILED - inequivalent arg 'x-max-priority' for queue 'compute-worker' in vhost '988f0756-7c60-40e6-9e6b-40d830125d83': received none but current is the value '10' of type 'signedint'

Caddy does not redirect the requests to RabbitMQ since it does not support other protocols than HTTP(S).
It also seems like restarting the site_worker might fix the issue temporarily, which makes it even less likely that the problem comes from Caddy.

@Didayolo
Copy link
Member

Didayolo commented Jun 17, 2024

Ah yes that is what we wanted to try. I couldn't remember and did not wanted to let the queue blocked for too much time. Thank you very much for you input.

So we can see that the error message is consistent.

@johanneskruse
Copy link
Author

It is running again after the reset - thank you!

@Didayolo
Copy link
Member

@johanneskruse May I ask how many workers you have on your private queue?

Apart from the bug reported above, it seems that your queue does not have enough resources to handle the load of your competition.

@johanneskruse
Copy link
Author

@Didayolo - the competition is coming to an end and we are experiencing a high-level of submissions.

At the moment we 20 workers; but I can see a few of them have crashed and I have to reset them. Is there a limit on amount of workers?

@xbaro
Copy link

xbaro commented Jun 19, 2024

According to this issue amqp-node/amqplib#165, the error:

codabench-site_worker-1  | [2024-06-17 05:36:48,866: ERROR/ForkPoolWorker-2] Task competitions.tasks._run_submission[...] 
...
codabench-site_worker-1  |     listener(*args)
codabench-site_worker-1  |   File "/usr/local/lib/python3.8/site-packages/amqp/channel.py", line 279, in _on_close
codabench-site_worker-1  |     raise error_for_code(
codabench-site_worker-1  | amqp.exceptions.PreconditionFailed: Queue.declare: (406) PRECONDITION_FAILED - inequivalent arg 'x-max-priority' for queue 'compute-worker' in vhost '988f0756-7c60-40e6-9e6b-40d830125d83': received none but current is the value '10' of type 'signedint'

from @Didayolo logs is due to a change in the queue configuration.

It seems that at some point the queue is created with 'x-max-priority=10', but at some latter moment is defined without this maximum priority value.

Does it have sense to you?

@Didayolo
Copy link
Member

Didayolo commented Jun 19, 2024

Is there a limit on amount of workers?

@johanneskruse : No, there is not limit.

It seems that at some point the queue is created with 'x-max-priority=10', but at some latter moment is defined without this maximum priority value.

@xbaro : yes, I've seen this explanation. However, everywhere in the code we set this value to 10, I don't know how exactly sometimes it receives a value of None.

Queue('compute-worker', Exchange('compute-worker'), routing_key='compute-worker', queue_arguments={'x-max-priority': 10}),

Queue('compute-worker', Exchange('compute-worker'), routing_key='compute-worker', queue_arguments={'x-max-priority': 10}),

Maybe it is declared somewhere else?

@johanneskruse
Copy link
Author

johanneskruse commented Jun 22, 2024

@Didayolo, @ObadaS - The remote workers seems to have lost connection again, could you reset them again? Thanks!

@Didayolo
Copy link
Member

Didayolo commented Jun 22, 2024

@johanneskruse Done.

Interesting investigation of this problem here:

#1352 (comment)

The problem happens after cancelling a submission

@Didayolo Didayolo mentioned this issue Jun 25, 2024
7 tasks
@Didayolo Didayolo linked a pull request Jun 25, 2024 that will close this issue
7 tasks
@johanneskruse
Copy link
Author

@Didayolo - workers seem to be disconnected.

@ObadaS
Copy link
Collaborator

ObadaS commented Jun 26, 2024

@johanneskruse Hello, I fixed it for now.

@Didayolo
Copy link
Member

@johanneskruse Hi, the good news is that we found a fix! We will deploy it soon in production.

@johanneskruse
Copy link
Author

@ObadaS - Thanks! I can confirm that they are running.
@Didayolo - Exciting, great job all of you and thanks for your continued support and improvements!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug P1 High priority, but NOT a current blocker
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants