From ea6ed389a09c275d4b5c7e9bbb304ff276197ee8 Mon Sep 17 00:00:00 2001
From: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Date: Mon, 25 Mar 2024 20:03:54 +0000
Subject: [PATCH] Overiew of the optional performance features that are yet to
 be upstreamed

---
 ROCm_performance.md | 14 ++++++++++++++
 1 file changed, 14 insertions(+)
 create mode 100644 ROCm_performance.md
diff --git a/ROCm_performance.md b/ROCm_performance.md
new file mode 100644
index 0000000000000..04f3d0fdf4932
--- /dev/null
+++ b/ROCm_performance.md
@@ -0,0 +1,14 @@
+# Overview of the optional performance features uinque to https://github.com/ROCm/vllm
+## Multi-GPU torchrun
+On ROCm the default multi GPU executor is `torchrun` as opposed to `ray` on NVIDIA  
+This can be overriden by the `--worker-use-ray` flag to vllm or its benchmarks  
+To utilize torchran parallelism, the run command should be midified from  
+`python <command>`  
+to  
+`torchrun --standalone --nnodes=1 --nproc-per-node=<workd-size> <command>`
+## Triton attention
+The default attention function on ROCm is using triton attention kernel. To fallback to the https://github.com/ROCm/flash-attention implementation set up the following environment symbol:  
+`VLLM_USE_FLASH_ATTN_TRITON=False`
+## Tunable ops
+Pytorch tunable ops are supported.  
+Define the following environment symbol: `PYTORCH_TUNABLEOP_ENABLED=1` in order to enable both the runtime tuning and the subsequent use of tuned results. To only use the tuned results without tuning any newly encountered shapes, also define `PYTORCH_TUNABLEOP_TUNING=1`