Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite the getValidWorkDivForKernel 1D and 2D tests #2331

Merged

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Jul 30, 2024

Fix the getValidWorkDivForKernel tests for the CUDA backend and merge it with the other parallel backends.

Fix the getValidWorkDivForKernel tests for the SYCL CPU backend and move it to the parallel backends.

Rewrite the getValidWorkDivForKernel tests: test more valid and invalid work divisions, merge the serial and parallel backends, and use traits to run tests specific to the serial backends.

@fwyzard fwyzard requested a review from psychocoderHPC July 30, 2024 17:55
@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 30, 2024

@mehmetyusufoglu can you double check the tests ?

@mehmetyusufoglu
Copy link
Contributor

@mehmetyusufoglu can you double check the tests ?

Ok, doing. Thanks.

@mehmetyusufoglu
Copy link
Contributor

mehmetyusufoglu commented Jul 30, 2024

@mehmetyusufoglu can you double check the tests ?

Locally everything worked, i am trying to understand which accelerator is generating the error. I will

@mehmetyusufoglu can you double check the tests ?

Locally everything worked for me as below, I pushed a copy branch to test on CI with logs, i will test on HAL as well.

/projects/alpaka-dir/alpaka/build/test/unit/workDiv$ ./workDivTest 
Randomness seeded to: 3752370894
- Acc: AccCpuSerial<2,unsigned int>, invalidWorkDiv: {gridBlockExtent: (8, 536870911), blockThreadExtent: (2, 1), threadElemExtent: (1, 1)}
- Acc: AccCpuThreads<2,unsigned int>, invalidWorkDiv: {gridBlockExtent: (8, 4329604), blockThreadExtent: (2, 124), threadElemExtent: (1, 1)}
- Acc: AccCpuTbbBlocks<2,unsigned int>, invalidWorkDiv: {gridBlockExtent: (8, 536870911), blockThreadExtent: (2, 1), threadElemExtent: (1, 1)}
- Acc: AccCpuOmp2Blocks<2,unsigned int>, invalidWorkDiv: {gridBlockExtent: (8, 536870911), blockThreadExtent: (2, 1), threadElemExtent: (1, 1)}
- Acc: AccCpuOmp2Threads<2,unsigned int>, invalidWorkDiv: {gridBlockExtent: (8, 35791394), blockThreadExtent: (2, 15), threadElemExtent: (1, 1)}
- Acc: AccGpuCudaRt<2,unsigned int>, invalidWorkDiv: {gridBlockExtent: (8, 729444), blockThreadExtent: (2, 736), threadElemExtent: (1, 1)}
===============================================================================
All tests passed (425 assertions in 85 test cases)

// Check that using too many threads per block is not valid.
auto const invalidWorkDiv
= WorkDiv{Vec{8, threadsPerGridTestValue / threadsPerBlock / 8}, Vec{2, threadsPerBlock}, Vec{1, 1}};
CHECK(not alpaka::isValidWorkDivKernel<Acc>(dev, kernelBundle, invalidWorkDiv));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial observation is that; the check assumes 2 * threadsPerBlock will be more than allowed block size. But threadsPerBlock is not the max block size, it is a value found by getValidWorkDiv, why 2 * threadsPerBlock should be more than allowed block-size limit for all accelerators?. Anyway I am trying to understand which accelerator is that. Thanks.

Copy link
Contributor

@mehmetyusufoglu mehmetyusufoglu Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be happening at AccCpuOmp2Threads. Below is valid because AccCpuOmp2Threads allows 64 threads. Test expect an invalid workdiv but it is valid. Which is in my opinion normal result. We should not test for AccCpuOmp2Threads. [TEST at HAL]

- Acc: AccCpuOmp2Threads<2,unsigned int>, invalidWorkDiv: {gridBlockExtent: (8, 22369621), blockThreadExtent: (2, 24), threadElemExtent: (1, 1)}
threadsPerBlock:24, threadsPerBlockLimit:64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was that

  • threadsPerGridTestValue is the maximum number of threads per grid (props.m_blockThreadCountMax * props.m_gridBlockCountMax)
  • Vec{8, threadsPerGridTestValue / threadsPerBlock / 8} × Vec{2, threadsPerBlock} should give a total of 2 * threadsPerGridTestValue / threadsPerBlock * threadsPerBlock / 8 * 8 which I expected to be bigger than threadsPerGridTestValue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, I don't expect to exceed the block size, but the total grid size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see !

Let me update the PR to test what we actually check in Alpaka first.

Before adding the new check we should check if it actually fails on the various backends, like CUDA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it but accDevProps of Omp2 threads is too large.

  • Acc: AccCpuOmp2Threads<2,unsigned int>, invalidWorkDiv: {gridBlockExtent: (8, 22369621), blockThreadExtent: (2, 24), threadElemExtent: (1, 1)}
    threadsPerBlock:24, threadsPerBlockLimit:64
    calculation: if(178956968 *48 > 4294967295 * 64) return false;
    check:4294967168 > 4294967232 returns true because accDevProps.m_gridBlockCountMax is too large for Ompthreads;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, also on the CpuSerial backend, it seems that the maximum grid size is not validated properly.

props.m_gridBlockCountMax is 2147483647 (the maximum value for a signed 32-bit integer), but launching a kernel with 4294967294 (twice that) blocks still works:

[+] QueueGenericThreadsBlocking
[-] QueueGenericThreadsBlocking
createTaskKernel workDiv: {gridBlockExtent: (4294967294), blockThreadExtent: (1), threadElemExtent: (1)}, kernelFnObj: TestKernelWithManyRegisters
[+] operator()
operator() blockSharedMemDynSizeBytes: 0 B
The sum is 1492.00, the argument is 200
The sum is 1492.00, the argument is 200
The sum is 1492.00, the argument is 200
The sum is 1492.00, the argument is 200
The sum is 1492.00, the argument is 200
The sum is 1492.00, the argument is 200
The sum is 1492.00, the argument is 200
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the time being I'll update the test to check only the block size, not the grid size.

We can follow up on the grid size check on #2334 .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great info, thanks.

@fwyzard fwyzard force-pushed the fix_WorkDivForKernelTest branch from f52bd9b to 6206326 Compare July 31, 2024 07:59
@fwyzard fwyzard requested a review from mehmetyusufoglu July 31, 2024 07:59
@psychocoderHPC psychocoderHPC added this to the 1.2.0 milestone Aug 1, 2024
@psychocoderHPC psychocoderHPC merged commit ba169cd into alpaka-group:develop Aug 1, 2024
22 checks passed
@fwyzard fwyzard deleted the fix_WorkDivForKernelTest branch August 1, 2024 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants