-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite the getValidWorkDivForKernel
1D and 2D tests
#2331
Rewrite the getValidWorkDivForKernel
1D and 2D tests
#2331
Conversation
@mehmetyusufoglu can you double check the tests ? |
Ok, doing. Thanks. |
Locally everything worked, i am trying to understand which accelerator is generating the error. I will
Locally everything worked for me as below, I pushed a copy branch to test on CI with logs, i will test on HAL as well.
|
// Check that using too many threads per block is not valid. | ||
auto const invalidWorkDiv | ||
= WorkDiv{Vec{8, threadsPerGridTestValue / threadsPerBlock / 8}, Vec{2, threadsPerBlock}, Vec{1, 1}}; | ||
CHECK(not alpaka::isValidWorkDivKernel<Acc>(dev, kernelBundle, invalidWorkDiv)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial observation is that; the check assumes 2 * threadsPerBlock
will be more than allowed block size. But threadsPerBlock
is not the max block size, it is a value found by getValidWorkDiv, why 2 * threadsPerBlock
should be more than allowed block-size limit for all accelerators?. Anyway I am trying to understand which accelerator is that. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to be happening at AccCpuOmp2Threads. Below is valid because AccCpuOmp2Threads allows 64 threads. Test expect an invalid workdiv but it is valid. Which is in my opinion normal result. We should not test for AccCpuOmp2Threads. [TEST at HAL]
- Acc: AccCpuOmp2Threads<2,unsigned int>, invalidWorkDiv: {gridBlockExtent: (8, 22369621), blockThreadExtent: (2, 24), threadElemExtent: (1, 1)}
threadsPerBlock:24, threadsPerBlockLimit:64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was that
threadsPerGridTestValue
is the maximum number of threads per grid (props.m_blockThreadCountMax * props.m_gridBlockCountMax
)Vec{8, threadsPerGridTestValue / threadsPerBlock / 8}
×Vec{2, threadsPerBlock}
should give a total of2 * threadsPerGridTestValue / threadsPerBlock * threadsPerBlock / 8 * 8
which I expected to be bigger thanthreadsPerGridTestValue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is, I don't expect to exceed the block size, but the total grid size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see !
Let me update the PR to test what we actually check in Alpaka first.
Before adding the new check we should check if it actually fails on the various backends, like CUDA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it but accDevProps of Omp2 threads is too large.
- Acc: AccCpuOmp2Threads<2,unsigned int>, invalidWorkDiv: {gridBlockExtent: (8, 22369621), blockThreadExtent: (2, 24), threadElemExtent: (1, 1)}
threadsPerBlock:24, threadsPerBlockLimit:64
calculation: if(178956968 *48 > 4294967295 * 64) return false;
check:4294967168 > 4294967232 returns true because accDevProps.m_gridBlockCountMax is too large for Ompthreads;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, also on the CpuSerial backend, it seems that the maximum grid size is not validated properly.
props.m_gridBlockCountMax
is 2147483647
(the maximum value for a signed 32-bit integer), but launching a kernel with 4294967294
(twice that) blocks still works:
[+] QueueGenericThreadsBlocking
[-] QueueGenericThreadsBlocking
createTaskKernel workDiv: {gridBlockExtent: (4294967294), blockThreadExtent: (1), threadElemExtent: (1)}, kernelFnObj: TestKernelWithManyRegisters
[+] operator()
operator() blockSharedMemDynSizeBytes: 0 B
The sum is 1492.00, the argument is 200
The sum is 1492.00, the argument is 200
The sum is 1492.00, the argument is 200
The sum is 1492.00, the argument is 200
The sum is 1492.00, the argument is 200
The sum is 1492.00, the argument is 200
The sum is 1492.00, the argument is 200
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the time being I'll update the test to check only the block size, not the grid size.
We can follow up on the grid size check on #2334 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great info, thanks.
f52bd9b
to
6206326
Compare
Fix the
getValidWorkDivForKernel
tests for the CUDA backend and merge it with the other parallel backends.Fix the
getValidWorkDivForKernel
tests for the SYCL CPU backend and move it to the parallel backends.Rewrite the
getValidWorkDivForKernel
tests: test more valid and invalid work divisions, merge the serial and parallel backends, and use traits to run tests specific to the serial backends.