You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Distributed GEMM is an experimental pipelined Tensor Parallelism implementation utilizing existing CUTLASS kernels and CUDA runtime features, which can hide the most of communication behind computation.
Improved persistent grid launch for Hopper kernels with large cluster sizes (>= size of 4) using the new make_kernel_hardware_info API as shown in example 48.
Enabled high precision accumulation for Hopper FP8 Sparse GEMM.