We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
通过cuda程序注入故障,发现故障不隔离, a pod 运行pytorch example训练程序,b pod 使用同一张卡, 运行故障注入程序, 发现故障注入后,训练程序dump。
a pod 日志如下:
b pod 注入故障程序如下:
#include <cuda_runtime.h> #include <device_launch_parameters.h> #include <cstdio> #include <cstdlib> #include <unistd.h> // CUDA 内核函数,在 GPU 上执行 __global__ void kernel(int i) { // 如果输入整数大于 0,则触发一个 trap if(i > 0) { asm("trap;"); } // 打印 i 的值到控制台 ::printf("%d\n", i); } // 检查 CUDA 错误 inline void error_check(cudaError_t err, const char* file, int line) { // 如果发生错误,打印错误信息到 stderr if(err != cudaSuccess) { ::fprintf(stderr, "CUDA ERROR at %s[%d] : %s\n", file, line, cudaGetErrorString(err)); } } // 宏定义,用于简化 CUDA 调用后的错误检查 #define CUDA_CHECK(err) do { error_check(err, __FILE__, __LINE__); } while(0) int main() { // 启动内核,使用1个block和单个线程,传递1作为参数,目的是直接报错 kernel<<<1, 1>>>(1); // 在内核启动后检查错误, cudaGetLastError()会同时将错误码设置成成功,不过对于sticky error, 都没有用 CUDA_CHECK(cudaPeekAtLastError()); CUDA_CHECK(cudaDeviceSynchronize()); sleep(180); }
The text was updated successfully, but these errors were encountered:
No branches or pull requests
通过cuda程序注入故障,发现故障不隔离,
a pod 运行pytorch example训练程序,b pod 使用同一张卡, 运行故障注入程序, 发现故障注入后,训练程序dump。
a pod 日志如下:
b pod 注入故障程序如下:
The text was updated successfully, but these errors were encountered: