From ea6ed389a09c275d4b5c7e9bbb304ff276197ee8 Mon Sep 17 00:00:00 2001 From: Gregory Shtrasberg Date: Mon, 25 Mar 2024 20:03:54 +0000 Subject: [PATCH] Overiew of the optional performance features that are yet to be upstreamed --- ROCm_performance.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) create mode 100644 ROCm_performance.md diff --git a/ROCm_performance.md b/ROCm_performance.md new file mode 100644 index 0000000000000..04f3d0fdf4932 --- /dev/null +++ b/ROCm_performance.md @@ -0,0 +1,14 @@ +# Overview of the optional performance features uinque to https://github.com/ROCm/vllm +## Multi-GPU torchrun +On ROCm the default multi GPU executor is `torchrun` as opposed to `ray` on NVIDIA +This can be overriden by the `--worker-use-ray` flag to vllm or its benchmarks +To utilize torchran parallelism, the run command should be midified from +`python ` +to +`torchrun --standalone --nnodes=1 --nproc-per-node= ` +## Triton attention +The default attention function on ROCm is using triton attention kernel. To fallback to the https://github.com/ROCm/flash-attention implementation set up the following environment symbol: +`VLLM_USE_FLASH_ATTN_TRITON=False` +## Tunable ops +Pytorch tunable ops are supported. +Define the following environment symbol: `PYTORCH_TUNABLEOP_ENABLED=1` in order to enable both the runtime tuning and the subsequent use of tuned results. To only use the tuned results without tuning any newly encountered shapes, also define `PYTORCH_TUNABLEOP_TUNING=1`