Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PoC] Run GPU workload in Gardener cluster and provide concept how to enable GPU in Kyma Runtime #18771

Open
pbochynski opened this issue Jan 17, 2025 · 2 comments
Assignees

Comments

@pbochynski
Copy link
Contributor

Users want to run their applications on GPU. In order to execute code that requires GPU you need proper drivers installed on the node. Investigate what is needed and propose a concept of automating this process. These are the aspects to cover:

  • How to build NVIDIA drivers?
  • Where to push the installer, how to deploy it to Kyma Runtimes?
  • How to install nvidia driver on all gpu nodes?
    • label GPU nodes to prepare proper node selector
    • run daemon set on labeled nodes
  • How to handle garden linux upgrades?
@a-thaler a-thaler mentioned this issue Jan 16, 2025
9 tasks
@pbochynski pbochynski self-assigned this Jan 17, 2025
@pbochynski
Copy link
Contributor Author

Progress update

I was able to build and run nvidia drivers using fork of https://github.com/gardenlinux/gardenlinux-nvidia-installer.
Fork link: https://github.com/pbochynski/gardenlinux-nvidia-installer
Changes:

  • added workflow to build and push installer to ghcr.
  • modified sample values to use that image and node affinity based on machine type
  • updated readme to reflect newest gardenlinux version with matching nvidia driver and how to setup image pull secret

License analysis

The drivers are not distributed with gardenlinux due to the NVIDIA license. The statements in the license clearly say that
NVIDIA grants you a non-exclusive, revocable, non-transferable and non-sublicensable license to deploy, for your own use, the SOFTWARE on infrastructure you own or lease and you may not sell, rent, sublicense, distribute or transfer the SOFTWARE or provide commercial hosting services with the SOFTWARE

Given that, I would rather avoid distributing the driver using docker images. We can protect images with the secret, but our users have access to the image pull secret and we cannot fully control who has access to the image and can download it. Nevertheless, that approach is suitable only for our own teams. We cannot redistribute drivers to external customers.

Recommendation

I suggest building Kyma module to download, compile, and install the driver when needed. The daemonset can be created using gardenlinux docker image that contains all kernel header files required for compilation.
To mitigate a problem with nvidia servers unavailability and speed up node startup time, we can use S3 (BTP Object Store) for caching. Cache would be provided by the cluster owner, and this way, we do not redistribute the software to other entities.

@pbochynski
Copy link
Contributor Author

Another ides from @a-thaler:
verify if we can provide GPU usage metrics.
Check this blogpost: https://blog.kubecost.com/blog/nvidia-gpu-usage/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant