Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft of ORT GPU build #5622

Open
wants to merge 24 commits into
base: master
Choose a base branch
from
Open

Conversation

ChSonnabend
Copy link
Collaborator

This is a draft PR to discuss possible changes to onnxruntime.sh for GPU builds on the EPN's and potentially CUDA (to be tested)

@ChSonnabend
Copy link
Collaborator Author

Ping @davidrohr

Copy link
Contributor

@davidrohr davidrohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Du solltest ein paar Umgebungsvariable, die wir in o2.sh nutzen, auch mitaufnehmen und analog behandeln: https://github.com/alisw/alidist/blob/1916f6d88d42959097998d9481b517dc1c1ea84d/o2.sh#L191C9-L191C30

  • ALIBUILD_O2_FORCE_GPU
  • DISABLE_GPU
  • ALIBUILD_ENABLE_CUDA
  • ALIBUILD_ENABLE_HIP
  • ALIBUILD_O2_OVERRIDE_HIP_ARCHS
  • ALIBUILD_O2_OVERRIDE_CUDA_ARCHS

Wenn ENABLE_CUDA oder ENABLE_HIP gesetzt ist, sollte der build fehlschlagen, wenn er CUDA/HIP nicht bauen kann.

onnxruntime.sh Outdated
"
elif command -v nvcc >/dev/null 2>&1; then
CUDA_VERSION=$(nvcc --version | grep "release" | awk '{print $NF}' | cut -d. -f1)
if [[ "$CUDA_VERSION" == "V11" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glaube CUDA 11 kannst du weglassen, und nur >=12 annehmen

onnxruntime.sh Outdated
ORT_BUILD_FLAGS=""
case $ARCHITECTURE in
osx_*)
if [[ $ARCHITECTURE == *_x86-64 ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solche printouts würde ich weglassen, das ist ja hauptsächlich für debugging

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ja, aber ich nehme an das es auch einen macOS build gibt der die Mac GPU anspricht. Da muss ich nochmal ein bisschen rumsuchen, dann könnte man den if-Block nämlich nehmen um da die build flags rein zu packen. Aber ja, die print-outs nehm ich am Ende natürlich noch raus

onnxruntime.sh Outdated
fi
;;
*)
if command -v rocminfo >/dev/null 2>&1; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • rocm version check fehlt
  • Es ist nicht klar, ob rocminfo im Pfad liegt. Du solltest zumindest /opt/rocm/bin/rocminfo testen. Und dann ist migraphx ein separates ROCm paket. Sprich, wenn rocminfo vorhanden ist, heist das noch nicht, das migraphx vorhanden ist. Du solltest explicit auf migraphx testen.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, das check ich nochmal

onnxruntime.sh Outdated
ORT_BUILD_FLAGS=" -Donnxruntime_USE_CUDA=ON \
-DCUDA_TOOLKIT_ROOT_DIR=$CUDA_ROOT \
-Donnxruntime_USE_CUDA_NHWC_OPS=ON \
-Donnxruntime_CUDA_USE_TENSORRT=ON \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wenn du tensorrt nutzt, musst du dann prüfen, ob das explicit installiert ist? Oder ist das immer beim CUDA SDK dabei?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheint nicht automatisch mitzukommen (https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html)... Ok da bau ich auch noch einen Check mit ein

onnxruntime.sh Outdated
-Donnxruntime_USE_CUDA_NHWC_OPS=ON \
-Donnxruntime_CUDA_USE_TENSORRT=ON \
"
elif [[ "$CUDA_VERSION" == "V12" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was ist wenn ROCm und CUDA beides vorhanden ist? Können wir dann nicht beides bauen?

Copy link
Collaborator Author

@ChSonnabend ChSonnabend Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ktf
Copy link
Member

ktf commented Sep 17, 2024

Gneau...

…adding env-variables for GPU enabling during code execution. For al9_gpu container and simultaneous CUDA & ROCm build, this requires ChSonnabend/onnxruntime@6ffc40c
…e build with CUDA and ROCm fails due to a ROCm internal check for THRUST and CUB libraries, which are not in sync (file: /opt/rocm/include/thrust/system/cuda/config.h)
onnxruntime.sh Outdated
export ORT_MIGRAPHX_BUILD=0
fi
### TensorRT
if [ "$ORT_CUDA_BUILD" -eq 1 ] && [ $(find /opt/rocm* -name "libnvinfer*" -print -quit | wc -l 2>&1) -eq 1 ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you search for tensort in /opt/rocm?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😅

onnxruntime.sh Outdated
-DCMAKE_CXX_FLAGS="$CXXFLAGS -Wno-unknown-warning -Wno-unknown-warning-option -Wno-error=unused-but-set-variable -Wno-error=deprecated" \
-DCMAKE_C_FLAGS="$CFLAGS -Wno-unknown-warning -Wno-unknown-warning-option -Wno-error=unused-but-set-variable -Wno-error=deprecated"
# Check ROCm build conditions
if { [ "$ALIBUILD_O2_FORCE_GPU" -ne 0 ] || [ "$ALIBUILD_ENABLE_HIP" -ne 0 ] || command -v rocminfo >/dev/null 2>&1; } && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it guaranteed that rocminfo is in the path? Perhaps, try rocminfo or /opt/rocm/bin/rocminfo?
Also, perhaps it needs HIP, not rocminfo? So perhaps check for /opt/rocm/bin/hipcc?

-DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ \
-D__HIP_PLATFORM_AMD__=1 \
-DCMAKE_HIP_ARCHITECTURES=gfx906,gfx908 \
${ALIBUILD_O2_OVERRIDE_HIP_ARCHS:+-DCMAKE_HIP_ARCHITECTURES=${ALIBUILD_O2_OVERRIDE_HIP_ARCHS}} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is also an OVERRIDE_CUDA_ARCH, could you check and use that as well?

onnxruntime.sh Outdated
-Donnxruntime_CUDA_HOME=/usr/local/cuda \
-DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ \
-D__HIP_PLATFORM_AMD__=1 \
-DCMAKE_HIP_ARCHITECTURES=gfx906,gfx908 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. You first set CMAKE_HIP_ARCHITECTURES, and then you possibly override it in the next line? Why don't you expand ALIBUILD_O2_OVERRIDE_HIP_ARCHS to the defaults if empty? ${...:-default}?

onnxruntime.sh Outdated
# Check CUDA build conditions
if { [ "$ALIBUILD_O2_FORCE_GPU" -ne 0 ] || [ "$ALIBUILD_ENABLE_CUDA" -ne 0 ] || command -v nvcc >/dev/null 2>&1; } && \
{ [ -z "$DISABLE_GPU" ] || [ "$DISABLE_GPU" -eq 0 ]; }; then
export ORT_CUDA_BUILD=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also set a default for ALIBUILD_O2_OVERRIDE_CUDA_ARCHS to sm_86 or sm_89 architecture for now

@ChSonnabend ChSonnabend marked this pull request as ready for review November 22, 2024 21:43
@ChSonnabend ChSonnabend requested a review from a team as a code owner November 22, 2024 21:43
Copy link
Contributor

@davidrohr davidrohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if [ "$ALIBUILD_O2_FORCE_GPU" -eq 1 ]
will not work if the variable is not defined.
You can either do [ "0$FOO" == "01" ] or use the bash [[ syntax. Please try with all variables undefined.

@ChSonnabend ChSonnabend requested a review from a team as a code owner November 28, 2024 13:29
ktf
ktf previously approved these changes Nov 29, 2024
@ktf
Copy link
Member

ktf commented Nov 29, 2024

Approving to start the CI.

ktf
ktf previously approved these changes Dec 1, 2024
@ChSonnabend
Copy link
Collaborator Author

ChSonnabend commented Dec 2, 2024

macOS failure seem unrelated to this PR. It can be merged from my side if there are no objections (@ktf ).

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm/lib
else
export ORT_ROCM_BUILD=0
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fi
mkdir -p $INSTALLROOT/etc
cat << EOF > $INSTALLROOT/etc/ort-init.sh
export ORT_ROCM_BUILD=$ORT_ROCM_BUILD
EOF

@@ -45,7 +117,6 @@ mkdir -p "$INSTALLROOT/etc/modulefiles"
MODULEFILE="$INSTALLROOT/etc/modulefiles/$PKGNAME"
alibuild-generate-module --lib > "$MODULEFILE"
cat >> "$MODULEFILE" <<EoF

# Our environment
set ${PKGNAME}_ROOT \$::env(BASEDIR)/$PKGNAME/\$version
prepend-path ROOT_INCLUDE_PATH \$${PKGNAME}_ROOT/include/onnxruntime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
prepend-path ROOT_INCLUDE_PATH \$${PKGNAME}_ROOT/include/onnxruntime
prepend-path ROOT_INCLUDE_PATH \$${PKGNAME}_ROOT/include/onnxruntime
append-path LD_LIBRARY_PATH /opt/rocm/lib

@ChSonnabend
Copy link
Collaborator Author

PR is ready from my side. GPU build should be disabled on EPN's due to missing ROCm packages and ALMA linux major version mismatch (8 instead of 9).

@singiamtel
Copy link
Collaborator

@pzhristov just FYI this would leave our onnx version outside the tf2onnx compatibility range https://github.com/onnx/tensorflow-onnx?tab=readme-ov-file#tf2onnx---convert-tensorflow-keras-tensorflowjs-and-tflite-models-to-onnx

@ChSonnabend
Copy link
Collaborator Author

@pzhristov just FYI this would leave our onnx version outside the tf2onnx compatibility range https://github.com/onnx/tensorflow-onnx?tab=readme-ov-file#tf2onnx---convert-tensorflow-keras-tensorflowjs-and-tflite-models-to-onnx

Why actually? If you refer to the opset number:I think the ONNX opset number does not necessarily correlate with the ONNX version (I might be wrong here). Does it actually break something for the converter?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants