Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote workers don't receive submissions #1455

Closed
johanneskruse opened this issue May 24, 2024 · 35 comments
Closed

Remote workers don't receive submissions #1455

johanneskruse opened this issue May 24, 2024 · 35 comments

Comments

@johanneskruse
Copy link

johanneskruse commented May 24, 2024

Dear Codabench team,

My remote workers have stopped receiving any traffic - is there an explanation, recent update, etc. From one day to the next, submissions are not being processed. Everything was good on May 23, 2024, but they stopped working on May 24, 2024.

I have multiple remote workers, and I do see that they are connected when turning on/off:

[2024-05-24 19:50:31,650: INFO/MainProcess] missed heartbeat from compute-worker@94d6bfcb71122
[2024-05-24 19:55:00,922: INFO/MainProcess] sync with compute-worker@6e369caec6052

When using the default CPU queue, I can run submissions; however, due to the 20-minute limit, I have to use my own remote workers.

Link to Ekstra Bladet News Recommendation Competition

Best,
Johannes

@ihsaan-ullah
Copy link
Collaborator

ihsaan-ullah commented May 25, 2024

Hi @johanneskruse

This is strange. We haven't changed anything that will break something. Can you please confirm the following:

  1. Your submissions work on the default queue
  2. Your workers are listening and the broker URL of your queue is configured there
  3. Do you have GPU or CPU workers?
  4. Are you trying by rerunning the submission or resubmitting

@johanneskruse
Copy link
Author

johanneskruse commented May 25, 2024 via email

@ihsaan-ullah
Copy link
Collaborator

Where are your workers hosted? Are you using google cloud or another service?

@johanneskruse
Copy link
Author

I am using Amazon Web Services (AWS) to run my remote workers on t3.xlarge instances (https://aws.amazon.com/ec2/instance-types/)

@ihsaan-ullah
Copy link
Collaborator

Can you do the following to see if this helps:

  1. create another queue on codabench.
  2. Stop your compute worker
  3. Change the broker URL in your .env in your worker
  4. Remove the compute worker image in your worker
  5. Create the the worker again by following the instructions.
  6. submit new submission

@ihsaan-ullah
Copy link
Collaborator

If this does not work, create a compute worker on Google Cloud and see if that works.

@johanneskruse
Copy link
Author

I have tried to follow your steps; unfortunately, it didn't change anything. I assume Codabench is run on Google Cloud. What are you running?

@ihsaan-ullah
Copy link
Collaborator

Have you tried a Google Cloud worker?

@johanneskruse
Copy link
Author

So far I've just been working with AWS and their cloud workers

@ihsaan-ullah
Copy link
Collaborator

Please try google cloud. That should work. I am not sure what is the problem with AWS. Or maybe contact AWS support maybe they will help you

@johanneskruse
Copy link
Author

johanneskruse commented May 26, 2024

Just to check if there is anything unusual. These are the output logs when starting my remote worker:

~$ docker logs -f compute_worker

/usr/local/lib/python3.8/site-packages/celery/platforms.py:800: RuntimeWarning: You're running the worker with superuser privileges: this is
absolutely not recommended!

Please specify a different user using the --uid option.

User information: uid=0 euid=0 gid=0 egid=0

  warnings.warn(RuntimeWarning(ROOT_DISCOURAGED.format(
 
 -------------- compute-worker@1f29b478b104 v4.4.0 (cliffs)
--- ***** ----- 
-- ******* ---- Linux-5.15.0-1057-aws-x86_64-with-glibc2.34 2024-05-26 06:40:33
- *** --- * --- 
- ** ---------- [config]
- ** ---------- .> app:         __main__:0x7fcc8a308a30
- ** ---------- .> transport:   amqp://63a35e45-cb28-4eed-9c2c-af8072bf9d9c:**@www.codabench.org:5672/394a4b70-ae72-4680-8bb8-4c956f7ed2f3
- ** ---------- .> results:     disabled://
- *** --- * --- .> concurrency: 1 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** ----- 
 -------------- [queues]
                .> compute-worker   exchange=compute-worker(direct) key=compute-worker
                

[tasks]
  . compute_worker_run

[2024-05-26 06:40:34,247: INFO/MainProcess] Connected to amqp://63a35e45-cb28-4eed-9c2c-af8072bf9d9c:**@www.codabench.org:5672/394a4b70-ae72-4680-8bb8-4c956f7ed2f3
[2024-05-26 06:40:34,399: INFO/MainProcess] mingle: searching for neighbors
[2024-05-26 06:40:35,774: INFO/MainProcess] mingle: all alone
[2024-05-26 06:40:36,142: INFO/MainProcess] compute-worker@1f29b478b104 ready.

@johanneskruse
Copy link
Author

Also, have you made a guide how to set it up using Google Cloud Compute?

@johanneskruse
Copy link
Author

And just for reference, this happened for all my remote workers:

Screenshot 2024-05-26 at 20 17 37

@ihsaan-ullah
Copy link
Collaborator

Logs look good.

For Google cloud the guidelines are the same. You have to go to google cloud console. Select VM instances. Click Create Instance (select storage, memory, location of VM, allow http/https traffic). You can access the VM through a shell provided on the console. The rest of the setup is the same

@johanneskruse
Copy link
Author

I've set up remote working using Google Cloud; however, unfortunately, this hasn't changed anything.

@ihsaan-ullah
Copy link
Collaborator

Please repeat the google cloud with a new queue.

@johanneskruse
Copy link
Author

johanneskruse commented May 27, 2024

With a new queue I get the following:

[2024-05-27 05:03:13,245: INFO/MainProcess] Connected to amqp://63a35e45-cb28-4eed-9c2c-af8072bf9d9c:**@www.codabench.org:5672/ff428499-5eab-44cf-8c25-6f2680a0a956
[2024-05-27 05:03:13,966: INFO/MainProcess] mingle: searching for neighbors
[2024-05-27 05:03:17,346: INFO/MainProcess] mingle: all alone
[2024-05-27 05:03:17,916: CRITICAL/MainProcess] Unrecoverable error: PreconditionFailed(406, "PRECONDITION_FAILED - inequivalent arg 'x-max-priority' for queue 'compute-worker' in vhost 'ff428499-5eab-44cf-8c25-6f2680a0a956': received the value '10' of type 'signedint' but current is none", (50, 10), 'Queue.declare')
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/celery/worker/worker.py", line 205, in start
    self.blueprint.start(self)
  File "/usr/local/lib/python3.8/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/usr/local/lib/python3.8/site-packages/celery/bootsteps.py", line 369, in start
    return self.obj.start()
  File "/usr/local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 318, in start
    blueprint.start(self)
  File "/usr/local/lib/python3.8/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/usr/local/lib/python3.8/site-packages/celery/worker/consumer/tasks.py", line 40, in start
    c.task_consumer = c.app.amqp.TaskConsumer(
  File "/usr/local/lib/python3.8/site-packages/celery/app/amqp.py", line 301, in TaskConsumer
    return self.Consumer(
  File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 386, in __init__
    self.revive(self.channel)
  File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 408, in revive
    self.declare()
  File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 421, in declare
    queue.declare()
  File "/usr/local/lib/python3.8/site-packages/kombu/entity.py", line 611, in declare
    self._create_queue(nowait=nowait, channel=channel)
  File "/usr/local/lib/python3.8/site-packages/kombu/entity.py", line 620, in _create_queue
    self.queue_declare(nowait=nowait, passive=False, channel=channel)
  File "/usr/local/lib/python3.8/site-packages/kombu/entity.py", line 648, in queue_declare
    ret = channel.queue_declare(
  File "/usr/local/lib/python3.8/site-packages/amqp/channel.py", line 1148, in queue_declare
    return queue_declare_ok_t(*self.wait(
  File "/usr/local/lib/python3.8/site-packages/amqp/abstract_channel.py", line 88, in wait
    self.connection.drain_events(timeout=timeout)
  File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 508, in drain_events
    while not self.blocking_read(timeout):
  File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 514, in blocking_read
    return self.on_inbound_frame(frame)
  File "/usr/local/lib/python3.8/site-packages/amqp/method_framing.py", line 55, in on_frame
    callback(channel, method_sig, buf, None)
  File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 520, in on_inbound_method
    return self.channels[channel_id].dispatch_method(
  File "/usr/local/lib/python3.8/site-packages/amqp/abstract_channel.py", line 145, in dispatch_method
    listener(*args)
  File "/usr/local/lib/python3.8/site-packages/amqp/channel.py", line 279, in _on_close
    raise error_for_code(
amqp.exceptions.PreconditionFailed: Queue.declare: (406) PRECONDITION_FAILED - inequivalent arg 'x-max-priority' for queue 'compute-worker' in vhost 'ff428499-5eab-44cf-8c25-6f2680a0a956': received the value '10' of type 'signedint' but current is none

Do you have any suggestions? I'm not sure if it's on Codabench or google cloud.

@ihsaan-ullah
Copy link
Collaborator

Not sure what is happening there.

Can you please list down the steps you are following to setup a worker and then linking it to your queue. @Didayolo do you have any idea?

By the way I was recently using Google Cloud workers for a competition and I haven't faced any issue like this

@johanneskruse
Copy link
Author

I follow the guide you have provided step-by-step.

On both AWS and Google Cloud.

  1. Create Instance
  2. docker pull codalab/competitions-v2-compute-worker
  3. Make .env file
  4. Make a broken URL and copy it to the.env file
  5. Run the docker run \ (...)

Everything was been working perfectly until Friday.

@Didayolo
Copy link
Member

Everything was been working perfectly until Friday.

You mean that you were able to process submissions on your workers in the past, and it stopped working?

@UppalAnshuk
Copy link

Yes, that's what has happened at our end. Taking over from @johanneskruse .

@UppalAnshuk
Copy link

Adding a few details, I've set up a new smaller version of the competition, the submissions for this new dummy competition run on the default queue without any issues but when I set up the new queue attached to a gcp worker, the submissions are stuck on "Submitting" and the worker gets no traffic, logs remain unchanged.

The steup of the GCP worker is the same as @ihsaan-ullah has provided above.

Logs look good.

For Google cloud the guidelines are the same. You have to go to google cloud console. Select VM instances. Click Create Instance (select storage, memory, location of VM, allow http/https traffic). You can access the VM through a shell provided on the console.

We are generally following the steps given here for CPU workers - https://github.com/codalab/codabench/wiki/Compute-Worker-Management---Setup

Thanks

@julianevanneeleman
Copy link

I'm running into this issue too, which can be reproduced by creating a competition using the example in https://github.com/codalab/competition-examples/tree/master/codabench/wheat_seeds.

  • Add a private queue in the Queue management tab
  • Add a queue field to the competition.yaml file in the example folder, with the value set to the Vhost of the queue
  • Run zip -r comp.zip . inside the example folder
  • Upload comp.zip in https://www.codabench.org/competitions/upload/

This creates a benchmark, with the private queue configured as indicated by the GUI.

Following the steps for running a compute worker (https://github.com/codalab/codabench/wiki/Compute-Worker-Management---Setup), create a .env like this:

# Queue URL
BROKER_URL=<REDACTED>

# Location to store submissions/cache -- absolute path!
HOST_DIRECTORY=/codabench

# If SSL isn't enabled, then comment or remove the following line
BROKER_USE_SSL=True

And spin a worker locally:

docker run \                                          
    -v /codabench:/codabench \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -d \
    --env-file .env \
    --name compute_worker \
    --restart unless-stopped \
    --log-opt max-size=50m \
    --log-opt max-file=3 \
    codalab/competitions-v2-compute-worker:latest

This container runs with the following logs:

/usr/local/lib/python3.8/site-packages/celery/platforms.py:800: RuntimeWarning: You're running the worker with superuser privileges: this is
absolutely not recommended!

Please specify a different user using the --uid option.

User information: uid=0 euid=0 gid=0 egid=0

  warnings.warn(RuntimeWarning(ROOT_DISCOURAGED.format(
 
 -------------- compute-worker@11bfd8fffc9e v4.4.0 (cliffs)
--- ***** ----- 
-- ******* ---- Linux-6.5.0-1023-oem-x86_64-with-glibc2.34 2024-05-27 10:48:23
- *** --- * --- 
- ** ---------- [config]
- ** ---------- .> app:         __main__:0x76d8c5837a30
- ** ---------- .> transport:   amqp://f6fb6c54-275d-4137-8212-f3fe87e4fc24:**@www.codabench.org:5672/699b95f9-b3a3-40a2-9701-02ebcd7d3158
- ** ---------- .> results:     disabled://
- *** --- * --- .> concurrency: 1 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** ----- 
 -------------- [queues]
                .> compute-worker   exchange=compute-worker(direct) key=compute-worker
                

[tasks]
  . compute_worker_run

[2024-05-27 10:48:23,877: INFO/MainProcess] Connected to amqp://f6fb6c54-275d-4137-8212-f3fe87e4fc24:**@www.codabench.org:5672/699b95f9-b3a3-40a2-9701-02ebcd7d3158
[2024-05-27 10:48:24,315: INFO/MainProcess] mingle: searching for neighbors
[2024-05-27 10:48:25,795: INFO/MainProcess] mingle: all alone
[2024-05-27 10:48:26,267: INFO/MainProcess] compute-worker@11bfd8fffc9e ready.

Then, submitting the default sample sample_code_submission.zip, it shows up in https://www.codabench.org/server_status:

image

It hangs from here.

@julianevanneeleman
Copy link

julianevanneeleman commented May 28, 2024

After reading #1457 and running the steps in my previous post without setting up a custom queue, I can confirm that even the default queue does not process jobs anymore.

Are all competitions currently halted?

@Didayolo
Copy link
Member

Didayolo commented May 28, 2024

@julianevanneeleman

Indeed, we have some issues with the default queue. Right now it is working again, by we are actively investigating the problem to avoid it happening again. You can follow this second problem here: #1446

@Didayolo
Copy link
Member

On my side I am not able to reproduce the problem. I tried several custom queues and several workers, and they are receiving and computing the submissions without problem.

Can you retry? Maybe it was linked to other problems of the platform (queue congestion, ...)

@johanneskruse
Copy link
Author

My workers are now receiving submissions again! I haven't changed or done anything on my end.

Thank you for the shift actions; I hope it stays stable for now. I will follow #1446 closely.

@Didayolo
Copy link
Member

Everything is probably linked. I'm closing this issue and keeping the other one open.

@johanneskruse
Copy link
Author

johanneskruse commented Jun 8, 2024

Hi @Didayolo - it is happening again. None of my workers are receiving any submissions they are all in limbo. No error logs.

Is there an explanation?

image

@Didayolo
Copy link
Member

Indeed I can see that the submissions are stuck in "Submitted" on your queue: https://www.codabench.org/server_status

I don't know the reason. I'll investigate this.

@johanneskruse
Copy link
Author

Hi @Didayolo, thank you for getting back!

Are there anything we can do in the mean while or help to debugging this issue?

@johanneskruse
Copy link
Author

johanneskruse commented Jun 10, 2024

I did start a new issue #1473 with more error logs.

@johanneskruse
Copy link
Author

johanneskruse commented Jun 10, 2024

Indeed I can see that the submissions are stuck in "Submitted" on your queue: https://www.codabench.org/server_status

I don't know the reason. I'll investigate this.

I'm now able to access the server_status; however, it doesn't register the newly assigned queue, and when using RecSys24_competition_v4, I get the error from #1473.

image

@johanneskruse
Copy link
Author

I tried to remove RecSys24_competition_v4. Now the queue is always *.

@johanneskruse
Copy link
Author

It's now showing the new queue, but there is still no connection to the worker:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants