You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am running a swin transformer backbone using onnxruntime python. The inference latency is normal when using sequential execution mode. After I change the execution mode to ORT_PARALLEL, the inference is way slower than before.
From the profiling it can be observed that actually no operation is done in parallel. Instead, operations are separated to difference threads and large amount of idle time is added in between.
Does anyone know what is causing this problem?
To reproduce
import onnxruntime as ort
from mmengine.config import Config
from mmengine.runner import Runner
model_path = "backbone.onnx"
sess_options = ort.SessionOptions()
sess_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL
session = ort.InferenceSession(model_path, sess_options, providers=["CUDAExecutionProvider"])
cfg = Config.fromfile("maskrcnn.py")
runner = Runner.from_cfg(cfg)
output_name = session.get_outputs()[0].name
data_iter = iter(runner.val_dataloader)
latencies = []
for i in range(200):
print(i, 200)
batch = next(data_iter)
data = runner.model.data_preprocessor(batch, False)
outputs = session.run([output_name], {'input': data['inputs'].cpu().numpy()})
Urgency
No response
Platform
Linux
OS Version
Red Hat Enterprise Linux release 8.10
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.17.0
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 11.8
Model File
No response
Is this a quantized model?
No
The text was updated successfully, but these errors were encountered:
Describe the issue
I am running a swin transformer backbone using onnxruntime python. The inference latency is normal when using sequential execution mode. After I change the execution mode to ORT_PARALLEL, the inference is way slower than before.
From the profiling it can be observed that actually no operation is done in parallel. Instead, operations are separated to difference threads and large amount of idle time is added in between.
Does anyone know what is causing this problem?
To reproduce
Urgency
No response
Platform
Linux
OS Version
Red Hat Enterprise Linux release 8.10
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.17.0
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 11.8
Model File
No response
Is this a quantized model?
No
The text was updated successfully, but these errors were encountered: