Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

故障不隔离 #25

Open
weapons97 opened this issue Oct 21, 2024 · 0 comments
Open

故障不隔离 #25

weapons97 opened this issue Oct 21, 2024 · 0 comments

Comments

@weapons97
Copy link

weapons97 commented Oct 21, 2024

通过cuda程序注入故障,发现故障不隔离,
a pod 运行pytorch example训练程序,b pod 使用同一张卡, 运行故障注入程序, 发现故障注入后,训练程序dump。

a pod 日志如下:
image

b pod 注入故障程序如下:

#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <cstdio>
#include <cstdlib>
#include <unistd.h>

// CUDA 内核函数,在 GPU 上执行
__global__ void kernel(int i) {
    // 如果输入整数大于 0,则触发一个 trap
    if(i > 0) {
        asm("trap;");
    }

    // 打印 i 的值到控制台
    ::printf("%d\n", i);
}

// 检查 CUDA 错误
inline void error_check(cudaError_t err, const char* file, int line) {
    // 如果发生错误,打印错误信息到 stderr
    if(err != cudaSuccess) {
        ::fprintf(stderr, "CUDA ERROR at %s[%d] : %s\n", file, line, cudaGetErrorString(err));
    }
}

// 宏定义,用于简化 CUDA 调用后的错误检查
#define CUDA_CHECK(err) do { error_check(err, __FILE__, __LINE__); } while(0)

int main() {
    // 启动内核,使用1个block和单个线程,传递1作为参数,目的是直接报错
    kernel<<<1, 1>>>(1);
    // 在内核启动后检查错误, cudaGetLastError()会同时将错误码设置成成功,不过对于sticky error, 都没有用
    CUDA_CHECK(cudaPeekAtLastError());
    CUDA_CHECK(cudaDeviceSynchronize());
    sleep(180);
}

@weapons97 weapons97 changed the title 故障不隔离隔离 故障不隔离 Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant