For CUDA fro-norm kernels, move syncthreads out of loop. Problem discovered under dpct-generated-sycl kernels. #590
Job | Run time |
---|---|
7m 14s | |
10m 27s | |
11m 6s | |
5m 58s | |
3m 44s | |
5m 37s | |
3m 57s | |
5m 21s | |
53m 24s |
Job | Run time |
---|---|
7m 14s | |
10m 27s | |
11m 6s | |
5m 58s | |
3m 44s | |
5m 37s | |
3m 57s | |
5m 21s | |
53m 24s |