Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend worker did not respond in given time #537

Open
payamahmadvand-stemcell opened this issue Jan 15, 2025 · 2 comments
Open

Backend worker did not respond in given time #537

payamahmadvand-stemcell opened this issue Jan 15, 2025 · 2 comments

Comments

@payamahmadvand-stemcell

I encountered an issue while attempting to segment multiple images in a loop. The error I am receiving is:

"Backend worker did not respond in the given time."

org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED

It appears that GPU memory is not being cleared properly, and the available variable space is diminishing with each image processed.

Interestingly, the code was functioning correctly before the latest commits made on December 12, 2024, and December 15, 2024. These recent changes might have introduced the problem.

2025-01-15T10:16:06.329-08:002025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 1 | 2025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 1 | AllTraffic/i-098f59f854f0cab5b -- | -- | --   | 2025-01-15T10:16:06.329-08:002025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error | 2025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error | AllTraffic/i-098f59f854f0cab5b   | 2025-01-15T10:16:06.329-08:00org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time | org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time | AllTraffic/i-098f59f854f0cab5b   | 2025-01-15T10:16:06.329-08:00#011at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:247) [model-server.jar:?] | #011at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:247) [model-server.jar:?] | AllTraffic/i-098f59f854f0cab5b   | 2025-01-15T10:16:06.329-08:00#011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?] | #011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?] | AllTraffic/i-098f59f854f0cab5b   | 2025-01-15T10:16:06.329-08:00#011at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] | #011at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] | AllTraffic/i-098f59f854f0cab5b   | 2025-01-15T10:16:06.329-08:00#011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] | #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] | AllTraffic/i-098f59f854f0cab5b   | 2025-01-15T10:16:06.329-08:00#011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] | #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] | AllTraffic/i-098f59f854f0cab5b   | 2025-01-15T10:16:06.329-08:00#011at java.lang.Thread.run(Thread.java:840) [?:?] 2025-01-15T10:16:06.329-08:00 2025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 1

2025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 1
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
2025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error

2025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time

org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
#011at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:247) [model-server.jar:?]

#011at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:247) [model-server.jar:?]
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
#011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]

#011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
#011at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]

#011at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
#011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]

#011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
#011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]

#011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
#011at java.lang.Thread.run(Thread.java:840) [?:?]

@payamahv
Copy link

I have already done these and still getting the error after processing a few large images

Explicitly clear GPU memory: Use torch.cuda.empty_cache() to clear the GPU memory after processing each image.

Delete variables: Ensure that you delete any variables holding large tensors after they are no longer needed using del.

Use with torch.no_grad(): Wrap your inference code with torch.no_grad() to prevent PyTorch from storing intermediate values for backpropagation, which can save memory.

@payamahmadvand-stemcell
Copy link
Author

Seem it has been reported here before
#258

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants