Backend worker did not respond in given time #537

payamahmadvand-stemcell · 2025-01-15T18:38:14Z

I encountered an issue while attempting to segment multiple images in a loop. The error I am receiving is:

"Backend worker did not respond in the given time."

org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED

It appears that GPU memory is not being cleared properly, and the available variable space is diminishing with each image processed.

Interestingly, the code was functioning correctly before the latest commits made on December 12, 2024, and December 15, 2024. These recent changes might have introduced the problem.

2025-01-15T10:16:06.329-08:002025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 1 | 2025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 1 | AllTraffic/i-098f59f854f0cab5b -- | -- | -- | 2025-01-15T10:16:06.329-08:002025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error | 2025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error | AllTraffic/i-098f59f854f0cab5b | 2025-01-15T10:16:06.329-08:00org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time | org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time | AllTraffic/i-098f59f854f0cab5b | 2025-01-15T10:16:06.329-08:00#011at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:247) [model-server.jar:?] | #011at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:247) [model-server.jar:?] | AllTraffic/i-098f59f854f0cab5b | 2025-01-15T10:16:06.329-08:00#011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?] | #011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?] | AllTraffic/i-098f59f854f0cab5b | 2025-01-15T10:16:06.329-08:00#011at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] | #011at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] | AllTraffic/i-098f59f854f0cab5b | 2025-01-15T10:16:06.329-08:00#011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] | #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] | AllTraffic/i-098f59f854f0cab5b | 2025-01-15T10:16:06.329-08:00#011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] | #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] | AllTraffic/i-098f59f854f0cab5b | 2025-01-15T10:16:06.329-08:00#011at java.lang.Thread.run(Thread.java:840) [?:?] 2025-01-15T10:16:06.329-08:00 2025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 1

2025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Number or consecutive unsuccessful inference 1
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
2025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error

2025-01-15T18:16:06,223 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time

org.pytorch.serve.wlm.WorkerInitializationException: Backend worker did not respond in given time
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
#011at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:247) [model-server.jar:?]

#011at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:247) [model-server.jar:?]
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
#011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]

#011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
#011at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]

#011at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
#011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]

#011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
#011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]

#011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
AllTraffic/i-098f59f854f0cab5b
2025-01-15T10:16:06.329-08:00
#011at java.lang.Thread.run(Thread.java:840) [?:?]

payamahv · 2025-01-16T18:55:02Z

I have already done these and still getting the error after processing a few large images

Explicitly clear GPU memory: Use torch.cuda.empty_cache() to clear the GPU memory after processing each image.

Delete variables: Ensure that you delete any variables holding large tensors after they are no longer needed using del.

Use with torch.no_grad(): Wrap your inference code with torch.no_grad() to prevent PyTorch from storing intermediate values for backpropagation, which can save memory.

payamahmadvand-stemcell · 2025-01-17T00:04:13Z

Seem it has been reported here before
#258

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend worker did not respond in given time #537

Backend worker did not respond in given time #537

payamahmadvand-stemcell commented Jan 15, 2025

payamahv commented Jan 16, 2025

payamahmadvand-stemcell commented Jan 17, 2025

Backend worker did not respond in given time #537

Backend worker did not respond in given time #537

Comments

payamahmadvand-stemcell commented Jan 15, 2025

payamahv commented Jan 16, 2025

payamahmadvand-stemcell commented Jan 17, 2025