Skip to content

CUTLASS 3.7.0

Latest
Compare
Choose a tag to compare
@hwu36 hwu36 released this 18 Jan 15:07
b78588d
  • A new Hopper blockwise scaling FP8 GEMM where the operands and block scaling tensor are staged via shared memory.
  • Distributed GEMM is an experimental pipelined Tensor Parallelism implementation utilizing existing CUTLASS kernels and CUDA runtime features, which can hide the most of communication behind computation.
  • Improved persistent grid launch for Hopper kernels with large cluster sizes (>= size of 4) using the new make_kernel_hardware_info API as shown in example 48.
  • Enabled high precision accumulation for Hopper FP8 Sparse GEMM.