Remote workers don't receive submissions #1455

johanneskruse · 2024-05-24T18:48:34Z

Dear Codabench team,

My remote workers have stopped receiving any traffic - is there an explanation, recent update, etc. From one day to the next, submissions are not being processed. Everything was good on May 23, 2024, but they stopped working on May 24, 2024.

I have multiple remote workers, and I do see that they are connected when turning on/off:

[2024-05-24 19:50:31,650: INFO/MainProcess] missed heartbeat from compute-worker@94d6bfcb71122
[2024-05-24 19:55:00,922: INFO/MainProcess] sync with compute-worker@6e369caec6052

When using the default CPU queue, I can run submissions; however, due to the 20-minute limit, I have to use my own remote workers.

Link to Ekstra Bladet News Recommendation Competition

Best,
Johannes

The text was updated successfully, but these errors were encountered:

ihsaan-ullah · 2024-05-25T14:22:54Z

Hi @johanneskruse

This is strange. We haven't changed anything that will break something. Can you please confirm the following:

Your submissions work on the default queue
Your workers are listening and the broker URL of your queue is configured there
Do you have GPU or CPU workers?
Are you trying by rerunning the submission or resubmitting

johanneskruse · 2024-05-25T14:29:35Z

hi, Thanks for the quick response. 1. Yes, it works on default queue, but fails due to 20-minute time limit. 2. The broker URL has been working until the 24.05.24 - never changed it. The workers are communicating with each other (have multiple). It been working perfectly for more than a month 3. They are CPU clusters 4. Yes, I have tried to resubmit and rerun submissions

ihsaan-ullah · 2024-05-25T14:31:52Z

Where are your workers hosted? Are you using google cloud or another service?

johanneskruse · 2024-05-26T05:46:29Z

I am using Amazon Web Services (AWS) to run my remote workers on t3.xlarge instances (https://aws.amazon.com/ec2/instance-types/)

ihsaan-ullah · 2024-05-26T06:11:38Z

Can you do the following to see if this helps:

create another queue on codabench.
Stop your compute worker
Change the broker URL in your .env in your worker
Remove the compute worker image in your worker
Create the the worker again by following the instructions.
submit new submission

ihsaan-ullah · 2024-05-26T06:12:13Z

If this does not work, create a compute worker on Google Cloud and see if that works.

johanneskruse · 2024-05-26T06:46:36Z

I have tried to follow your steps; unfortunately, it didn't change anything. I assume Codabench is run on Google Cloud. What are you running?

ihsaan-ullah · 2024-05-26T06:53:14Z

Have you tried a Google Cloud worker?

johanneskruse · 2024-05-26T06:55:41Z

So far I've just been working with AWS and their cloud workers

ihsaan-ullah · 2024-05-26T07:00:05Z

Please try google cloud. That should work. I am not sure what is the problem with AWS. Or maybe contact AWS support maybe they will help you

johanneskruse · 2024-05-26T19:17:51Z

Just to check if there is anything unusual. These are the output logs when starting my remote worker:

~$ docker logs -f compute_worker

/usr/local/lib/python3.8/site-packages/celery/platforms.py:800: RuntimeWarning: You're running the worker with superuser privileges: this is
absolutely not recommended!

Please specify a different user using the --uid option.

User information: uid=0 euid=0 gid=0 egid=0

  warnings.warn(RuntimeWarning(ROOT_DISCOURAGED.format(
 
 -------------- compute-worker@1f29b478b104 v4.4.0 (cliffs)
--- ***** ----- 
-- ******* ---- Linux-5.15.0-1057-aws-x86_64-with-glibc2.34 2024-05-26 06:40:33
- *** --- * --- 
- ** ---------- [config]
- ** ---------- .> app:         __main__:0x7fcc8a308a30
- ** ---------- .> transport:   amqp://63a35e45-cb28-4eed-9c2c-af8072bf9d9c:**@www.codabench.org:5672/394a4b70-ae72-4680-8bb8-4c956f7ed2f3
- ** ---------- .> results:     disabled://
- *** --- * --- .> concurrency: 1 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** ----- 
 -------------- [queues]
                .> compute-worker   exchange=compute-worker(direct) key=compute-worker
                

[tasks]
  . compute_worker_run

[2024-05-26 06:40:34,247: INFO/MainProcess] Connected to amqp://63a35e45-cb28-4eed-9c2c-af8072bf9d9c:**@www.codabench.org:5672/394a4b70-ae72-4680-8bb8-4c956f7ed2f3
[2024-05-26 06:40:34,399: INFO/MainProcess] mingle: searching for neighbors
[2024-05-26 06:40:35,774: INFO/MainProcess] mingle: all alone
[2024-05-26 06:40:36,142: INFO/MainProcess] compute-worker@1f29b478b104 ready.

johanneskruse · 2024-05-27T01:29:30Z

Also, have you made a guide how to set it up using Google Cloud Compute?

johanneskruse · 2024-05-27T03:20:01Z

And just for reference, this happened for all my remote workers:

ihsaan-ullah · 2024-05-27T04:06:28Z

Logs look good.

For Google cloud the guidelines are the same. You have to go to google cloud console. Select VM instances. Click Create Instance (select storage, memory, location of VM, allow http/https traffic). You can access the VM through a shell provided on the console. The rest of the setup is the same

johanneskruse · 2024-05-27T04:51:59Z

I've set up remote working using Google Cloud; however, unfortunately, this hasn't changed anything.

ihsaan-ullah · 2024-05-27T04:54:42Z

Please repeat the google cloud with a new queue.

johanneskruse · 2024-05-27T05:04:11Z

With a new queue I get the following:

[2024-05-27 05:03:13,245: INFO/MainProcess] Connected to amqp://63a35e45-cb28-4eed-9c2c-af8072bf9d9c:**@www.codabench.org:5672/ff428499-5eab-44cf-8c25-6f2680a0a956
[2024-05-27 05:03:13,966: INFO/MainProcess] mingle: searching for neighbors
[2024-05-27 05:03:17,346: INFO/MainProcess] mingle: all alone
[2024-05-27 05:03:17,916: CRITICAL/MainProcess] Unrecoverable error: PreconditionFailed(406, "PRECONDITION_FAILED - inequivalent arg 'x-max-priority' for queue 'compute-worker' in vhost 'ff428499-5eab-44cf-8c25-6f2680a0a956': received the value '10' of type 'signedint' but current is none", (50, 10), 'Queue.declare')
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/celery/worker/worker.py", line 205, in start
    self.blueprint.start(self)
  File "/usr/local/lib/python3.8/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/usr/local/lib/python3.8/site-packages/celery/bootsteps.py", line 369, in start
    return self.obj.start()
  File "/usr/local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 318, in start
    blueprint.start(self)
  File "/usr/local/lib/python3.8/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/usr/local/lib/python3.8/site-packages/celery/worker/consumer/tasks.py", line 40, in start
    c.task_consumer = c.app.amqp.TaskConsumer(
  File "/usr/local/lib/python3.8/site-packages/celery/app/amqp.py", line 301, in TaskConsumer
    return self.Consumer(
  File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 386, in __init__
    self.revive(self.channel)
  File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 408, in revive
    self.declare()
  File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 421, in declare
    queue.declare()
  File "/usr/local/lib/python3.8/site-packages/kombu/entity.py", line 611, in declare
    self._create_queue(nowait=nowait, channel=channel)
  File "/usr/local/lib/python3.8/site-packages/kombu/entity.py", line 620, in _create_queue
    self.queue_declare(nowait=nowait, passive=False, channel=channel)
  File "/usr/local/lib/python3.8/site-packages/kombu/entity.py", line 648, in queue_declare
    ret = channel.queue_declare(
  File "/usr/local/lib/python3.8/site-packages/amqp/channel.py", line 1148, in queue_declare
    return queue_declare_ok_t(*self.wait(
  File "/usr/local/lib/python3.8/site-packages/amqp/abstract_channel.py", line 88, in wait
    self.connection.drain_events(timeout=timeout)
  File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 508, in drain_events
    while not self.blocking_read(timeout):
  File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 514, in blocking_read
    return self.on_inbound_frame(frame)
  File "/usr/local/lib/python3.8/site-packages/amqp/method_framing.py", line 55, in on_frame
    callback(channel, method_sig, buf, None)
  File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 520, in on_inbound_method
    return self.channels[channel_id].dispatch_method(
  File "/usr/local/lib/python3.8/site-packages/amqp/abstract_channel.py", line 145, in dispatch_method
    listener(*args)
  File "/usr/local/lib/python3.8/site-packages/amqp/channel.py", line 279, in _on_close
    raise error_for_code(
amqp.exceptions.PreconditionFailed: Queue.declare: (406) PRECONDITION_FAILED - inequivalent arg 'x-max-priority' for queue 'compute-worker' in vhost 'ff428499-5eab-44cf-8c25-6f2680a0a956': received the value '10' of type 'signedint' but current is none

Do you have any suggestions? I'm not sure if it's on Codabench or google cloud.

ihsaan-ullah · 2024-05-27T05:08:12Z

Not sure what is happening there.

Can you please list down the steps you are following to setup a worker and then linking it to your queue. @Didayolo do you have any idea?

By the way I was recently using Google Cloud workers for a competition and I haven't faced any issue like this

johanneskruse · 2024-05-27T05:31:32Z

I follow the guide you have provided step-by-step.

On both AWS and Google Cloud.

Create Instance
docker pull codalab/competitions-v2-compute-worker
Make .env file
Make a broken URL and copy it to the.env file
Run the docker run \ (...)

Everything was been working perfectly until Friday.

Didayolo · 2024-05-27T08:28:24Z

Everything was been working perfectly until Friday.

You mean that you were able to process submissions on your workers in the past, and it stopped working?

UppalAnshuk · 2024-05-27T09:49:08Z

Yes, that's what has happened at our end. Taking over from @johanneskruse .

UppalAnshuk · 2024-05-27T10:00:15Z

Adding a few details, I've set up a new smaller version of the competition, the submissions for this new dummy competition run on the default queue without any issues but when I set up the new queue attached to a gcp worker, the submissions are stuck on "Submitting" and the worker gets no traffic, logs remain unchanged.

The steup of the GCP worker is the same as @ihsaan-ullah has provided above.

Logs look good.

For Google cloud the guidelines are the same. You have to go to google cloud console. Select VM instances. Click Create Instance (select storage, memory, location of VM, allow http/https traffic). You can access the VM through a shell provided on the console.

We are generally following the steps given here for CPU workers - https://github.com/codalab/codabench/wiki/Compute-Worker-Management---Setup

Thanks

julianevanneeleman · 2024-05-27T11:13:53Z

I'm running into this issue too, which can be reproduced by creating a competition using the example in https://github.com/codalab/competition-examples/tree/master/codabench/wheat_seeds.

Add a private queue in the Queue management tab
Add a queue field to the competition.yaml file in the example folder, with the value set to the Vhost of the queue
Run zip -r comp.zip . inside the example folder
Upload comp.zip in https://www.codabench.org/competitions/upload/

This creates a benchmark, with the private queue configured as indicated by the GUI.

Following the steps for running a compute worker (https://github.com/codalab/codabench/wiki/Compute-Worker-Management---Setup), create a .env like this:

# Queue URL
BROKER_URL=<REDACTED>

# Location to store submissions/cache -- absolute path!
HOST_DIRECTORY=/codabench

# If SSL isn't enabled, then comment or remove the following line
BROKER_USE_SSL=True

And spin a worker locally:

docker run \                                          
    -v /codabench:/codabench \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -d \
    --env-file .env \
    --name compute_worker \
    --restart unless-stopped \
    --log-opt max-size=50m \
    --log-opt max-file=3 \
    codalab/competitions-v2-compute-worker:latest

This container runs with the following logs:

/usr/local/lib/python3.8/site-packages/celery/platforms.py:800: RuntimeWarning: You're running the worker with superuser privileges: this is
absolutely not recommended!

Please specify a different user using the --uid option.

User information: uid=0 euid=0 gid=0 egid=0

  warnings.warn(RuntimeWarning(ROOT_DISCOURAGED.format(
 
 -------------- compute-worker@11bfd8fffc9e v4.4.0 (cliffs)
--- ***** ----- 
-- ******* ---- Linux-6.5.0-1023-oem-x86_64-with-glibc2.34 2024-05-27 10:48:23
- *** --- * --- 
- ** ---------- [config]
- ** ---------- .> app:         __main__:0x76d8c5837a30
- ** ---------- .> transport:   amqp://f6fb6c54-275d-4137-8212-f3fe87e4fc24:**@www.codabench.org:5672/699b95f9-b3a3-40a2-9701-02ebcd7d3158
- ** ---------- .> results:     disabled://
- *** --- * --- .> concurrency: 1 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** ----- 
 -------------- [queues]
                .> compute-worker   exchange=compute-worker(direct) key=compute-worker
                

[tasks]
  . compute_worker_run

[2024-05-27 10:48:23,877: INFO/MainProcess] Connected to amqp://f6fb6c54-275d-4137-8212-f3fe87e4fc24:**@www.codabench.org:5672/699b95f9-b3a3-40a2-9701-02ebcd7d3158
[2024-05-27 10:48:24,315: INFO/MainProcess] mingle: searching for neighbors
[2024-05-27 10:48:25,795: INFO/MainProcess] mingle: all alone
[2024-05-27 10:48:26,267: INFO/MainProcess] compute-worker@11bfd8fffc9e ready.

Then, submitting the default sample sample_code_submission.zip, it shows up in https://www.codabench.org/server_status:

It hangs from here.

julianevanneeleman · 2024-05-28T07:09:40Z

After reading #1457 and running the steps in my previous post without setting up a custom queue, I can confirm that even the default queue does not process jobs anymore.

Are all competitions currently halted?

Didayolo · 2024-05-28T12:04:06Z

@julianevanneeleman

Indeed, we have some issues with the default queue. Right now it is working again, by we are actively investigating the problem to avoid it happening again. You can follow this second problem here: #1446

Didayolo · 2024-05-28T15:50:12Z

On my side I am not able to reproduce the problem. I tried several custom queues and several workers, and they are receiving and computing the submissions without problem.

Can you retry? Maybe it was linked to other problems of the platform (queue congestion, ...)

johanneskruse · 2024-05-28T16:38:44Z

My workers are now receiving submissions again! I haven't changed or done anything on my end.

Thank you for the shift actions; I hope it stays stable for now. I will follow #1446 closely.

Didayolo · 2024-05-29T10:50:18Z

Everything is probably linked. I'm closing this issue and keeping the other one open.

johanneskruse · 2024-06-08T15:19:01Z

Hi @Didayolo - it is happening again. None of my workers are receiving any submissions they are all in limbo. No error logs.

Is there an explanation?

Didayolo · 2024-06-10T08:55:35Z

Indeed I can see that the submissions are stuck in "Submitted" on your queue: https://www.codabench.org/server_status

I don't know the reason. I'll investigate this.

johanneskruse · 2024-06-10T16:52:22Z

Hi @Didayolo, thank you for getting back!

Are there anything we can do in the mean while or help to debugging this issue?

johanneskruse · 2024-06-10T16:53:02Z

I did start a new issue #1473 with more error logs.

johanneskruse · 2024-06-10T19:48:20Z

Indeed I can see that the submissions are stuck in "Submitted" on your queue: https://www.codabench.org/server_status

I don't know the reason. I'll investigate this.

I'm now able to access the server_status; however, it doesn't register the newly assigned queue, and when using RecSys24_competition_v4, I get the error from #1473.

johanneskruse · 2024-06-10T22:02:52Z

I tried to remove RecSys24_competition_v4. Now the queue is always *.

johanneskruse · 2024-06-11T04:47:31Z

It's now showing the new queue, but there is still no connection to the worker:

Didayolo added the Workers / Queues label May 27, 2024

Didayolo closed this as completed May 29, 2024

Didayolo mentioned this issue May 29, 2024

Queue congestion (public queue) #1446

Closed

liviust mentioned this issue Jun 24, 2024

Fix revoking tasks on custom queues #1352

Merged

7 tasks

Remote workers don't receive submissions #1455

Remote workers don't receive submissions #1455

Comments

johanneskruse commented May 24, 2024 • edited Loading

ihsaan-ullah commented May 25, 2024 • edited Loading

johanneskruse commented May 25, 2024 via email • edited Loading

ihsaan-ullah commented May 25, 2024

johanneskruse commented May 26, 2024

ihsaan-ullah commented May 26, 2024

ihsaan-ullah commented May 26, 2024

johanneskruse commented May 26, 2024

ihsaan-ullah commented May 26, 2024

johanneskruse commented May 26, 2024

ihsaan-ullah commented May 26, 2024

johanneskruse commented May 26, 2024 • edited Loading

johanneskruse commented May 27, 2024

johanneskruse commented May 27, 2024

ihsaan-ullah commented May 27, 2024

johanneskruse commented May 27, 2024

ihsaan-ullah commented May 27, 2024

johanneskruse commented May 27, 2024 • edited Loading

ihsaan-ullah commented May 27, 2024

johanneskruse commented May 27, 2024

Didayolo commented May 27, 2024

UppalAnshuk commented May 27, 2024

UppalAnshuk commented May 27, 2024

julianevanneeleman commented May 27, 2024

julianevanneeleman commented May 28, 2024 • edited Loading

Didayolo commented May 28, 2024 • edited Loading

Didayolo commented May 28, 2024

johanneskruse commented May 28, 2024

Didayolo commented May 29, 2024

johanneskruse commented Jun 8, 2024 • edited Loading

Didayolo commented Jun 10, 2024

johanneskruse commented Jun 10, 2024

johanneskruse commented Jun 10, 2024 • edited Loading

johanneskruse commented Jun 10, 2024 • edited Loading

johanneskruse commented Jun 10, 2024

johanneskruse commented Jun 11, 2024

johanneskruse commented May 24, 2024 •

edited

Loading

ihsaan-ullah commented May 25, 2024 •

edited

Loading

johanneskruse commented May 25, 2024 via email •

edited

Loading

johanneskruse commented May 26, 2024 •

edited

Loading

johanneskruse commented May 27, 2024 •

edited

Loading

julianevanneeleman commented May 28, 2024 •

edited

Loading

Didayolo commented May 28, 2024 •

edited

Loading

johanneskruse commented Jun 8, 2024 •

edited

Loading

johanneskruse commented Jun 10, 2024 •

edited

Loading

johanneskruse commented Jun 10, 2024 •

edited

Loading