Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OneAPI support? #18

Open
oscarbg opened this issue Apr 4, 2022 · 8 comments
Open

OneAPI support? #18

oscarbg opened this issue Apr 4, 2022 · 8 comments

Comments

@oscarbg
Copy link

oscarbg commented Apr 4, 2022

Hi,
if we could get an Orochi OneAPI backend, then GPU desktop support, should be complete, from a vendor viewpoint..
hope it gets added eventually.. and I get notified when it's done by having opened the issue :-)
EDIT: don't know if OneAPI is currently as complete as HIP and CUDA like the rtc component for example so adding support for it, is possible or easy..

@takahiroharada
Copy link
Collaborator

Orochi is designed to add more backends if needed. Adding intel support makes sense to complete the project. We are happy to work with anyone who's interested in adding it although we don't have any plan to add it at the moment.

takahiroharada added a commit that referenced this issue Jun 2, 2022
* [ORO-0] Working 8 bit radix sort.

* [ORO-0] Some optimization.

* Create LICENSE

* Update README.md (#15)

* Feature/oro 0 raw get set (#19)

* [ORO-0] Rename setter and getter.

* [ORO-0] Fix when there is a dll but no device.

* [ORO-0] Deletion function.

* [ORO-0] Multi processor count.

* [ORO-0] Extended the sort to more than 8 bits. Implemented tests.

* [ORO-0] Moved temp buffer allocation out from the sort().

* [ORO-0] README. References.

* [ORO-0] Debug flag.

* Refactor the code to add the basic constructs to support selecting different scan algorithms.
Add different implementation of the scan algorithm: CPU, single WG and all WG .

Signed-off-by: Chih-Chen Kao <[email protected]>

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* Optimization: Implement the single-pass kernel for GPU parallel scan.
Fix a GPU memory bug.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 kernel cache (#4)

* [ORO-0] Cache kernel.

* [ORO-0] Support newer HIP builds on windows (#22)

* [ORO-0] Unit test. (#23)

* Fix LDS scan bug.
The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block).
Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work
because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap).

Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize the LDS scan algorithm. (#6)

* Optimize the LDS scan algorithm.
This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Support an input array in LDS that is 2 times the WG size.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 clean up (#7)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* Feature/oro 0 clean up (#10)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* [ORO-0] SortKernel1. Less complex. (#8)

SortKernel (occupancy: 8)
- vgpr: 128
- lds: 6704
SortKernel1 (occupancy: 9)
- vgpr: 106
- lds 7720

* [ORO-0] Kernel execution time check.

* Fix the memory access pattern and change it to coalesced memory access. (#11)

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Single kernel sort for small keys. (#12)

* Optimize the Count kernel for less LDS usage to achieve full occupancy (#13)

* Optimize the Count kernel to let it use less LDS and could achieve full occupancy.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Remove __threadfence_block()

Removes the boundary check in the inner loop.
The upper bound is set only once before going into the loop.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Introduce DRIVER and RTC APIs

* Disable enum-variant

* Improve paths

* Add fields

* Update Vulkan test

* Define CUDA in terms of DRIVER and RTC

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

* Merging another merge (#18)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15)

* Calculate the number of WGs based on LDS and max-thread-per-WGP.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Add a workaround for CUDA.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Implement key-value pair sorting (#17)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Add basic structure for key-value pair sorting.
Fix an error in single pass sort


Signed-off-by: Chih-Chen Kao <[email protected]>

* Add Value data in the test and sort it according to keys.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Support Key only sorting.

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Make single pass kernel non compile time switch.

* Support both Key-Only & Key-Value pair sort kernels


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Test change.

* [ORO-0] A bug.

* [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible.

Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>

* [ORO-0] Revert demo code.

* Fix missing CUDA properties.  (#26)

* Update Orochi.cpp

* [ORO-0] Clean up.

* [ORO-0] OroUtils. (#27)

* [ORO-0] OroUtils.

* [ORO-0] Linux build fix.

* [ORO-0] Forgot to add.

* [ORO-0] Linux build fix.

* [ORO-0] Clean up.

Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
jammm added a commit that referenced this issue Aug 19, 2022
* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 radix sort (#19)

* [ORO-0] Working 8 bit radix sort.

* [ORO-0] Some optimization.

* Create LICENSE

* Update README.md (#15)

* Feature/oro 0 raw get set (#19)

* [ORO-0] Rename setter and getter.

* [ORO-0] Fix when there is a dll but no device.

* [ORO-0] Deletion function.

* [ORO-0] Multi processor count.

* [ORO-0] Extended the sort to more than 8 bits. Implemented tests.

* [ORO-0] Moved temp buffer allocation out from the sort().

* [ORO-0] README. References.

* [ORO-0] Debug flag.

* Refactor the code to add the basic constructs to support selecting different scan algorithms.
Add different implementation of the scan algorithm: CPU, single WG and all WG .

Signed-off-by: Chih-Chen Kao <[email protected]>

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* Optimization: Implement the single-pass kernel for GPU parallel scan.
Fix a GPU memory bug.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 kernel cache (#4)

* [ORO-0] Cache kernel.

* [ORO-0] Support newer HIP builds on windows (#22)

* [ORO-0] Unit test. (#23)

* Fix LDS scan bug.
The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block).
Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work
because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap).

Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize the LDS scan algorithm. (#6)

* Optimize the LDS scan algorithm.
This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Support an input array in LDS that is 2 times the WG size.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 clean up (#7)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* Feature/oro 0 clean up (#10)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* [ORO-0] SortKernel1. Less complex. (#8)

SortKernel (occupancy: 8)
- vgpr: 128
- lds: 6704
SortKernel1 (occupancy: 9)
- vgpr: 106
- lds 7720

* [ORO-0] Kernel execution time check.

* Fix the memory access pattern and change it to coalesced memory access. (#11)

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Single kernel sort for small keys. (#12)

* Optimize the Count kernel for less LDS usage to achieve full occupancy (#13)

* Optimize the Count kernel to let it use less LDS and could achieve full occupancy.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Remove __threadfence_block()

Removes the boundary check in the inner loop.
The upper bound is set only once before going into the loop.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Introduce DRIVER and RTC APIs

* Disable enum-variant

* Improve paths

* Add fields

* Update Vulkan test

* Define CUDA in terms of DRIVER and RTC

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

* Merging another merge (#18)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15)

* Calculate the number of WGs based on LDS and max-thread-per-WGP.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Add a workaround for CUDA.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Implement key-value pair sorting (#17)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Add basic structure for key-value pair sorting.
Fix an error in single pass sort


Signed-off-by: Chih-Chen Kao <[email protected]>

* Add Value data in the test and sort it according to keys.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Support Key only sorting.

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Make single pass kernel non compile time switch.

* Support both Key-Only & Key-Value pair sort kernels


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Test change.

* [ORO-0] A bug.

* [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible.

Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>

* [ORO-0] Revert demo code.

* Fix missing CUDA properties.  (#26)

* Update Orochi.cpp

* [ORO-0] Clean up.

* [ORO-0] OroUtils. (#27)

* [ORO-0] OroUtils.

* [ORO-0] Linux build fix.

* [ORO-0] Forgot to add.

* [ORO-0] Linux build fix.

* [ORO-0] Clean up.

Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>

* Add kernel path and include dir to the functions. (#20)

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] BakeKernel. (#21)

* [ORO-0] BakeKernel.

* Update tools/genArgs.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/genArgs.py

dead code removal

* Update tools/stringify.py

dead code removal

* fix include

Signed-off-by: Chih-Chen Kao <[email protected]>

* fix script

Signed-off-by: Chih-Chen Kao <[email protected]>

* fix

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Fix Orochi CUDA API (#23)

Fix Orochi CUDA API 

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Linux build fix. (#22)

* [ORO-0] Linux build fix.

* Fix Orochi CUDA API


Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Quick fix for old linux gcc which does not support std::exclusive_scan (#24)

Quick fix for old linux gcc which does not support std::exclusive_scan

Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix the kernel cache bug. (#25)

Fix the kernel cache bug.

The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid.

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Remove static variables. (#26)

* [ORO-0] Remove static variables.

* [ORO-0] Applied the suggestions.

* [ORO-0] Linux regression fix.

* Fix OrochiUtils::getFunctionFromString API (#27)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Adding missing assert (#28)

* Adding missing assert

* Adding more asserts

* Feature/oro 0 gpuopen merge (#31)

* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <[email protected]>

* fix footnote markdown format (#39)

* Fix orochi utils issue in unit tests

Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>

Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
Co-authored-by: Daniel Meister <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>
jammm added a commit that referenced this issue Nov 27, 2022
* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <[email protected]>

* fix footnote markdown format (#39)

* Feature/oro 0 amdadvtech merge (#43)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 radix sort (#19)

* [ORO-0] Working 8 bit radix sort.

* [ORO-0] Some optimization.

* Create LICENSE

* Update README.md (#15)

* Feature/oro 0 raw get set (#19)

* [ORO-0] Rename setter and getter.

* [ORO-0] Fix when there is a dll but no device.

* [ORO-0] Deletion function.

* [ORO-0] Multi processor count.

* [ORO-0] Extended the sort to more than 8 bits. Implemented tests.

* [ORO-0] Moved temp buffer allocation out from the sort().

* [ORO-0] README. References.

* [ORO-0] Debug flag.

* Refactor the code to add the basic constructs to support selecting different scan algorithms.
Add different implementation of the scan algorithm: CPU, single WG and all WG .

Signed-off-by: Chih-Chen Kao <[email protected]>

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* Optimization: Implement the single-pass kernel for GPU parallel scan.
Fix a GPU memory bug.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 kernel cache (#4)

* [ORO-0] Cache kernel.

* [ORO-0] Support newer HIP builds on windows (#22)

* [ORO-0] Unit test. (#23)

* Fix LDS scan bug.
The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block).
Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work
because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap).

Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize the LDS scan algorithm. (#6)

* Optimize the LDS scan algorithm.
This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Support an input array in LDS that is 2 times the WG size.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 clean up (#7)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* Feature/oro 0 clean up (#10)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* [ORO-0] SortKernel1. Less complex. (#8)

SortKernel (occupancy: 8)
- vgpr: 128
- lds: 6704
SortKernel1 (occupancy: 9)
- vgpr: 106
- lds 7720

* [ORO-0] Kernel execution time check.

* Fix the memory access pattern and change it to coalesced memory access. (#11)

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Single kernel sort for small keys. (#12)

* Optimize the Count kernel for less LDS usage to achieve full occupancy (#13)

* Optimize the Count kernel to let it use less LDS and could achieve full occupancy.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Remove __threadfence_block()

Removes the boundary check in the inner loop.
The upper bound is set only once before going into the loop.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Introduce DRIVER and RTC APIs

* Disable enum-variant

* Improve paths

* Add fields

* Update Vulkan test

* Define CUDA in terms of DRIVER and RTC

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

* Merging another merge (#18)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15)

* Calculate the number of WGs based on LDS and max-thread-per-WGP.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Add a workaround for CUDA.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Implement key-value pair sorting (#17)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Add basic structure for key-value pair sorting.
Fix an error in single pass sort


Signed-off-by: Chih-Chen Kao <[email protected]>

* Add Value data in the test and sort it according to keys.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Support Key only sorting.

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Make single pass kernel non compile time switch.

* Support both Key-Only & Key-Value pair sort kernels


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Test change.

* [ORO-0] A bug.

* [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible.

Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>

* [ORO-0] Revert demo code.

* Fix missing CUDA properties.  (#26)

* Update Orochi.cpp

* [ORO-0] Clean up.

* [ORO-0] OroUtils. (#27)

* [ORO-0] OroUtils.

* [ORO-0] Linux build fix.

* [ORO-0] Forgot to add.

* [ORO-0] Linux build fix.

* [ORO-0] Clean up.

Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>

* Add kernel path and include dir to the functions. (#20)

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] BakeKernel. (#21)

* [ORO-0] BakeKernel.

* Update tools/genArgs.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/genArgs.py

dead code removal

* Update tools/stringify.py

dead code removal

* fix include

Signed-off-by: Chih-Chen Kao <[email protected]>

* fix script

Signed-off-by: Chih-Chen Kao <[email protected]>

* fix

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Fix Orochi CUDA API (#23)

Fix Orochi CUDA API 

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Linux build fix. (#22)

* [ORO-0] Linux build fix.

* Fix Orochi CUDA API


Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Quick fix for old linux gcc which does not support std::exclusive_scan (#24)

Quick fix for old linux gcc which does not support std::exclusive_scan

Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix the kernel cache bug. (#25)

Fix the kernel cache bug.

The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid.

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Remove static variables. (#26)

* [ORO-0] Remove static variables.

* [ORO-0] Applied the suggestions.

* [ORO-0] Linux regression fix.

* Fix OrochiUtils::getFunctionFromString API (#27)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Adding missing assert (#28)

* Adding missing assert

* Adding more asserts

* Feature/oro 0 gpuopen merge (#31)

* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <[email protected]>

* fix footnote markdown format (#39)

* Fix orochi utils issue in unit tests

Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>

Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
Co-authored-by: Daniel Meister <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>

* [ORO-0] bitcode/cubin linking APIs (#40)

* [ORO-0] Link apis.

* [ORO-0] Forgot to add.

* [ORO-0] Linking test.

* [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize

* [ORO-0] Update link unit tests with comments

* [ORO-0] Change test for CUBIN instead of PTX

* [ORO-0] Fix loadfile to use binary mode, remove printf in kernel

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Load amdhip first, then hiprtc.

* [ORO-0] Remove assert from hiprtc library checks

* [ORO-0] Add gfx1030 bitcode for navi21

* [MNN-0] Fix premake and add more link testcases

* [ORO-0] Update a link_null_name testcase

* [ORO-0] Make unit tests more stable on CUDA

* [ORO-0] Update bitcode for gfx1030

* [ORO-0] Add bitcodes for navi1,2, vega

* [ORO-0] Add hiprtc.dll and comgr dll

* [ORO-0] Add gfx906 bitcodes

* [ORO-0] Support unit tests on both HIP and CUDA

* [ORO-0] Update dlls and bitcodes

* [ORO-0] Update bitcodes and generation script

* [ORO-0] Minor fixes in bundled bitcode unit tests

* [ORO-0] Fix typo in options

* [ORO-0] Fix getCUBIN/PTX signatures

* [ORO-0] Fix unit tests and generate fatbin for CUDA

* [ORO-0] Regenerate fatbin and fix script

* [ORO-0] Cleanup

* [ORO-0] Update bundled bitcodes to only contain navi21 for now

* [ORO-0] Updated bundled bitcode

* [ORO-0] add ORO_LAUNCH_PARAMS_*

* [ORO-0] Add unit test for orortcLinkAddFile

* [ORO-0] Add unittest scripts for TC

* [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA

* [ORO-0] Add bitcode+bundled bitcode link test

* [ORO-0] Cleanup

* [ORO-0] Fix typo in script

* [ORO-0] Update linux TC script

Co-authored-by: takahiroharada <[email protected]>

* [ORO-0] Get global memory size for CUDA (#44)

* [ORO-0] Update HIP dll's for bitcode linking support

* [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46)

* [ORO-0] Update HIP dll's for bitcode linking support

* [ORO-0] Add getLoweredName testcase

* [ORO-0] Update unittest filter

* [ORO-0] Update loweredName test

* [ORO-0] Add missing test kernel

* [ORO-0] Fix loweredName test

* [ORO-0] Fix linux compilation

* [ORO-0] Remove printf from test kernel (#37)

* [ORO-0] Allow usage of libhiprtc64.so if exists

* [ORO-0] Fix linux loading of libhiprtc.so

Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
Co-authored-by: Daniel Meister <[email protected]>
Co-authored-by: PixelClear <[email protected]>
@mudapanda2
Copy link

I'm going to be attempting to tackle this or a portion thereof for a passion project I've started to take all the Cycles rendering backends and port them to Orochi while attempting to use Orochi to implement more. CUDA/Optix, RoCM/Vulkan RT, resurrecting OpenCL as an option with RT (Maybe). Right now the most difficult part is deciding how/why/who/what where when there's big projects and sweeping changes ongoing in this arena, Makes a guy want to turn into a syclpath.

Passion project goal is to reduce Cycles renderer code from 4 implementations to 1 and then ask Orochi to conduct the symphony to make that 1 work on whatever hardware is presented. So more of a 1 headed, 8 tailed dragon.

@takahiroharada
Copy link
Collaborator

Actually, I started one api level zero hook up to Orochi while ago (which I haven't had time to go back and finish up), which was working fine although there are some stuff which didn't go as clean as I want as the model is slightly different.

@oscarbg
Copy link
Author

oscarbg commented Mar 1, 2023

@takahiroharada nice! Can publish even the “unfinished work” as a branch on public orochi github repo? so someone can start from there and iterate on your work..
Think would be nice so your work isn’t lost..

@takahiroharada
Copy link
Collaborator

@oscarbg FYI, it's in our fork right now but I can push the branch to this orochi repo.

https://github.com/amdadvtech/Orochi/tree/feature/ORO-0-oneapi
amdadvtech@3eaabdc

Not all tests are working but basic and important ones are working

Note: Google Test filter = *kernelExec*:*init*:*deviceprops*:*malloc*
[==========] Running 6 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 6 tests from OroTestBase
[ RUN      ] OroTestBase.init
[       OK ] OroTestBase.init (91 ms)
[ RUN      ] OroTestBase.deviceprops
executing on Intel(R) Arc(TM) A770 Graphics ()
512 multiProcessors
[       OK ] OroTestBase.deviceprops (6 ms)
[ RUN      ] OroTestBase.malloc
[       OK ] OroTestBase.malloc (28 ms)
[ RUN      ] OroTestBase.kernelExec
[       OK ] OroTestBase.kernelExec (81 ms)
[ RUN      ] OroTestBase.kernelExecPreCompiled
0: 123
1: 123
2: 123
3: 123
4: 123
5: 123
6: 123
7: 123
8: 123
9: 123
10: 123
11: 123
12: 123
13: 123
14: 123
15: 123
16: 123
17: 123
18: 123
19: 123
20: 123
21: 123
22: 123
23: 123
24: 123
25: 123
26: 123
27: 123
28: 123
29: 123
30: 123
31: 123
32: 123
33: 123
34: 123
35: 123
36: 123
37: 123
38: 123
39: 123
40: 123
41: 123
42: 123
43: 123
44: 123
45: 123
46: 123
47: 123
48: 123
49: 123
50: 123
51: 123
52: 123
53: 123
54: 123
55: 123
56: 123
57: 123
58: 123
59: 123
60: 123
61: 123
62: 123
63: 123
[       OK ] OroTestBase.kernelExecPreCompiled (126 ms)
[ RUN      ] OroTestBase.kernelExecPreCompiled1
executing on Intel(R) Arc(TM) A770 Graphics ()
512 multiProcessors
[       OK ] OroTestBase.kernelExecPreCompiled1 (93 ms)
[----------] 6 tests from OroTestBase (427 ms total)

[----------] Global test environment tear-down
[==========] 6 tests from 1 test case ran. (428 ms total)
[  PASSED  ] 6 tests.

@oscarbg
Copy link
Author

oscarbg commented Mar 8, 2023

Nice! Will take a look.. thanks..

@takahiroharada
Copy link
Collaborator

The change was copied to a branch in this repo.

https://github.com/GPUOpen-LibrariesAndSDKs/Orochi/tree/feature/ORO-0-oneapi

jammm added a commit that referenced this issue May 22, 2023
* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <[email protected]>

* fix footnote markdown format (#39)

* Feature/oro 0 amdadvtech merge (#43)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 radix sort (#19)

* [ORO-0] Working 8 bit radix sort.

* [ORO-0] Some optimization.

* Create LICENSE

* Update README.md (#15)

* Feature/oro 0 raw get set (#19)

* [ORO-0] Rename setter and getter.

* [ORO-0] Fix when there is a dll but no device.

* [ORO-0] Deletion function.

* [ORO-0] Multi processor count.

* [ORO-0] Extended the sort to more than 8 bits. Implemented tests.

* [ORO-0] Moved temp buffer allocation out from the sort().

* [ORO-0] README. References.

* [ORO-0] Debug flag.

* Refactor the code to add the basic constructs to support selecting different scan algorithms.
Add different implementation of the scan algorithm: CPU, single WG and all WG .

Signed-off-by: Chih-Chen Kao <[email protected]>

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* Optimization: Implement the single-pass kernel for GPU parallel scan.
Fix a GPU memory bug.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 kernel cache (#4)

* [ORO-0] Cache kernel.

* [ORO-0] Support newer HIP builds on windows (#22)

* [ORO-0] Unit test. (#23)

* Fix LDS scan bug.
The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block).
Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work
because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap).

Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize the LDS scan algorithm. (#6)

* Optimize the LDS scan algorithm.
This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Support an input array in LDS that is 2 times the WG size.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 clean up (#7)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* Feature/oro 0 clean up (#10)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* [ORO-0] SortKernel1. Less complex. (#8)

SortKernel (occupancy: 8)
- vgpr: 128
- lds: 6704
SortKernel1 (occupancy: 9)
- vgpr: 106
- lds 7720

* [ORO-0] Kernel execution time check.

* Fix the memory access pattern and change it to coalesced memory access. (#11)

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Single kernel sort for small keys. (#12)

* Optimize the Count kernel for less LDS usage to achieve full occupancy (#13)

* Optimize the Count kernel to let it use less LDS and could achieve full occupancy.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Remove __threadfence_block()

Removes the boundary check in the inner loop.
The upper bound is set only once before going into the loop.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Introduce DRIVER and RTC APIs

* Disable enum-variant

* Improve paths

* Add fields

* Update Vulkan test

* Define CUDA in terms of DRIVER and RTC

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

* Merging another merge (#18)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15)

* Calculate the number of WGs based on LDS and max-thread-per-WGP.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Add a workaround for CUDA.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Implement key-value pair sorting (#17)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Add basic structure for key-value pair sorting.
Fix an error in single pass sort


Signed-off-by: Chih-Chen Kao <[email protected]>

* Add Value data in the test and sort it according to keys.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Support Key only sorting.

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Make single pass kernel non compile time switch.

* Support both Key-Only & Key-Value pair sort kernels


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Test change.

* [ORO-0] A bug.

* [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible.

Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>

* [ORO-0] Revert demo code.

* Fix missing CUDA properties.  (#26)

* Update Orochi.cpp

* [ORO-0] Clean up.

* [ORO-0] OroUtils. (#27)

* [ORO-0] OroUtils.

* [ORO-0] Linux build fix.

* [ORO-0] Forgot to add.

* [ORO-0] Linux build fix.

* [ORO-0] Clean up.

Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>

* Add kernel path and include dir to the functions. (#20)

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] BakeKernel. (#21)

* [ORO-0] BakeKernel.

* Update tools/genArgs.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/genArgs.py

dead code removal

* Update tools/stringify.py

dead code removal

* fix include

Signed-off-by: Chih-Chen Kao <[email protected]>

* fix script

Signed-off-by: Chih-Chen Kao <[email protected]>

* fix

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Fix Orochi CUDA API (#23)

Fix Orochi CUDA API 

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Linux build fix. (#22)

* [ORO-0] Linux build fix.

* Fix Orochi CUDA API


Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Quick fix for old linux gcc which does not support std::exclusive_scan (#24)

Quick fix for old linux gcc which does not support std::exclusive_scan

Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix the kernel cache bug. (#25)

Fix the kernel cache bug.

The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid.

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Remove static variables. (#26)

* [ORO-0] Remove static variables.

* [ORO-0] Applied the suggestions.

* [ORO-0] Linux regression fix.

* Fix OrochiUtils::getFunctionFromString API (#27)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Adding missing assert (#28)

* Adding missing assert

* Adding more asserts

* Feature/oro 0 gpuopen merge (#31)

* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <[email protected]>

* fix footnote markdown format (#39)

* Fix orochi utils issue in unit tests

Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>

Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
Co-authored-by: Daniel Meister <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>

* [ORO-0] bitcode/cubin linking APIs (#40)

* [ORO-0] Link apis.

* [ORO-0] Forgot to add.

* [ORO-0] Linking test.

* [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize

* [ORO-0] Update link unit tests with comments

* [ORO-0] Change test for CUBIN instead of PTX

* [ORO-0] Fix loadfile to use binary mode, remove printf in kernel

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Load amdhip first, then hiprtc.

* [ORO-0] Remove assert from hiprtc library checks

* [ORO-0] Add gfx1030 bitcode for navi21

* [MNN-0] Fix premake and add more link testcases

* [ORO-0] Update a link_null_name testcase

* [ORO-0] Make unit tests more stable on CUDA

* [ORO-0] Update bitcode for gfx1030

* [ORO-0] Add bitcodes for navi1,2, vega

* [ORO-0] Add hiprtc.dll and comgr dll

* [ORO-0] Add gfx906 bitcodes

* [ORO-0] Support unit tests on both HIP and CUDA

* [ORO-0] Update dlls and bitcodes

* [ORO-0] Update bitcodes and generation script

* [ORO-0] Minor fixes in bundled bitcode unit tests

* [ORO-0] Fix typo in options

* [ORO-0] Fix getCUBIN/PTX signatures

* [ORO-0] Fix unit tests and generate fatbin for CUDA

* [ORO-0] Regenerate fatbin and fix script

* [ORO-0] Cleanup

* [ORO-0] Update bundled bitcodes to only contain navi21 for now

* [ORO-0] Updated bundled bitcode

* [ORO-0] add ORO_LAUNCH_PARAMS_*

* [ORO-0] Add unit test for orortcLinkAddFile

* [ORO-0] Add unittest scripts for TC

* [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA

* [ORO-0] Add bitcode+bundled bitcode link test

* [ORO-0] Cleanup

* [ORO-0] Fix typo in script

* [ORO-0] Update linux TC script

Co-authored-by: takahiroharada <[email protected]>

* [ORO-0] Get global memory size for CUDA (#44)

* [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46)

* [ORO-0] Update HIP dll's for bitcode linking support

* [ORO-0] Add getLoweredName testcase

* [ORO-0] Update unittest filter

* [ORO-0] Update loweredName test

* [ORO-0] Add missing test kernel

* [ORO-0] Fix loweredName test

* [ORO-0] Fix linux compilation

* [ORO-0] Remove printf from test kernel (#37)

* [ORO-0] Fix linux loading of libhiprtc.so (#49)

* [ORO-0] Update test scripts (#50)

* [ORO-0] Update scripts for linux (#51)

* [ORO-0] Add new scripts (#52)

* [ORO-0] Add new scripts

* [ORO-0] Add execute permissions to scripts

* Fix Unit Test: getErrorString (#54)

Signed-off-by: Chih-Chen Kao <[email protected]>

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Support hiprtc0504 (#55)

* [ORO-0] Update hiprtc and orortc error codes (#57)

* [ORO-0] Update test scripts to delete cache before running (#58)

* [ORO-0] Update hiprtc dlls

* [ORO-0] Support gfx1100,gfx1102 for radix sort kernel precompilation

* Fix apt python installation (#63)

Update checkout version


Signed-off-by: Chih-Chen Kao <[email protected]>

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] OrochiUtils update. (#61)

* [ORO-0] Add WMMA test (#62)

* [ORO-0] Add WMMA test

* [ORO-0] Add a comment for WMMA

* [ORO-0] Cleanup

* [ORO-0] Add a couple more comments

* [ORO-0] Remove hip_runtime include

* [ORO-0] Cleanup

* [ORO-0] Fix comment

* [ORO-0] Add Copyright notice

* [ORO-0] Load binary from the directory where DLL is.

* [ORO-0] Fix for linux.

---------

Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
Co-authored-by: Daniel Meister <[email protected]>
Co-authored-by: PixelClear <[email protected]>
takahiroharada added a commit that referenced this issue Sep 20, 2023
* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 radix sort (#19)

* [ORO-0] Working 8 bit radix sort.

* [ORO-0] Some optimization.

* Create LICENSE

* Update README.md (#15)

* Feature/oro 0 raw get set (#19)

* [ORO-0] Rename setter and getter.

* [ORO-0] Fix when there is a dll but no device.

* [ORO-0] Deletion function.

* [ORO-0] Multi processor count.

* [ORO-0] Extended the sort to more than 8 bits. Implemented tests.

* [ORO-0] Moved temp buffer allocation out from the sort().

* [ORO-0] README. References.

* [ORO-0] Debug flag.

* Refactor the code to add the basic constructs to support selecting different scan algorithms.
Add different implementation of the scan algorithm: CPU, single WG and all WG .

Signed-off-by: Chih-Chen Kao <[email protected]>

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* Optimization: Implement the single-pass kernel for GPU parallel scan.
Fix a GPU memory bug.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 kernel cache (#4)

* [ORO-0] Cache kernel.

* [ORO-0] Support newer HIP builds on windows (#22)

* [ORO-0] Unit test. (#23)

* Fix LDS scan bug.
The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block).
Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work
because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap).

Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize the LDS scan algorithm. (#6)

* Optimize the LDS scan algorithm.
This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Support an input array in LDS that is 2 times the WG size.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 clean up (#7)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* Feature/oro 0 clean up (#10)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* [ORO-0] SortKernel1. Less complex. (#8)

SortKernel (occupancy: 8)
- vgpr: 128
- lds: 6704
SortKernel1 (occupancy: 9)
- vgpr: 106
- lds 7720

* [ORO-0] Kernel execution time check.

* Fix the memory access pattern and change it to coalesced memory access. (#11)

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Single kernel sort for small keys. (#12)

* Optimize the Count kernel for less LDS usage to achieve full occupancy (#13)

* Optimize the Count kernel to let it use less LDS and could achieve full occupancy.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Remove __threadfence_block()

Removes the boundary check in the inner loop.
The upper bound is set only once before going into the loop.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Introduce DRIVER and RTC APIs

* Disable enum-variant

* Improve paths

* Add fields

* Update Vulkan test

* Define CUDA in terms of DRIVER and RTC

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

* Merging another merge (#18)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15)

* Calculate the number of WGs based on LDS and max-thread-per-WGP.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Add a workaround for CUDA.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Implement key-value pair sorting (#17)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Add basic structure for key-value pair sorting.
Fix an error in single pass sort


Signed-off-by: Chih-Chen Kao <[email protected]>

* Add Value data in the test and sort it according to keys.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Support Key only sorting.

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Make single pass kernel non compile time switch.

* Support both Key-Only & Key-Value pair sort kernels


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Test change.

* [ORO-0] A bug.

* [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible.

Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>

* [ORO-0] Revert demo code.

* Fix missing CUDA properties.  (#26)

* Update Orochi.cpp

* [ORO-0] Clean up.

* [ORO-0] OroUtils. (#27)

* [ORO-0] OroUtils.

* [ORO-0] Linux build fix.

* [ORO-0] Forgot to add.

* [ORO-0] Linux build fix.

* [ORO-0] Clean up.

Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>

* Add kernel path and include dir to the functions. (#20)

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] BakeKernel. (#21)

* [ORO-0] BakeKernel.

* Update tools/genArgs.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/genArgs.py

dead code removal

* Update tools/stringify.py

dead code removal

* fix include

Signed-off-by: Chih-Chen Kao <[email protected]>

* fix script

Signed-off-by: Chih-Chen Kao <[email protected]>

* fix

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Fix Orochi CUDA API (#23)

Fix Orochi CUDA API 

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Linux build fix. (#22)

* [ORO-0] Linux build fix.

* Fix Orochi CUDA API


Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Quick fix for old linux gcc which does not support std::exclusive_scan (#24)

Quick fix for old linux gcc which does not support std::exclusive_scan

Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix the kernel cache bug. (#25)

Fix the kernel cache bug.

The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid.

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Remove static variables. (#26)

* [ORO-0] Remove static variables.

* [ORO-0] Applied the suggestions.

* [ORO-0] Linux regression fix.

* Fix OrochiUtils::getFunctionFromString API (#27)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Adding missing assert (#28)

* Adding missing assert

* Adding more asserts

* Feature/oro 0 gpuopen merge (#31)

* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <[email protected]>

* fix footnote markdown format (#39)

* Fix orochi utils issue in unit tests

Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>

* remove space after -I (#33)

* Feature/oro 0 gpuopen merge 2 (#32)

* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <[email protected]>

* fix footnote markdown format (#39)

* Feature/oro 0 amdadvtech merge (#43)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 radix sort (#19)

* [ORO-0] Working 8 bit radix sort.

* [ORO-0] Some optimization.

* Create LICENSE

* Update README.md (#15)

* Feature/oro 0 raw get set (#19)

* [ORO-0] Rename setter and getter.

* [ORO-0] Fix when there is a dll but no device.

* [ORO-0] Deletion function.

* [ORO-0] Multi processor count.

* [ORO-0] Extended the sort to more than 8 bits. Implemented tests.

* [ORO-0] Moved temp buffer allocation out from the sort().

* [ORO-0] README. References.

* [ORO-0] Debug flag.

* Refactor the code to add the basic constructs to support selecting different scan algorithms.
Add different implementation of the scan algorithm: CPU, single WG and all WG .

Signed-off-by: Chih-Chen Kao <[email protected]>

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* Optimization: Implement the single-pass kernel for GPU parallel scan.
Fix a GPU memory bug.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 kernel cache (#4)

* [ORO-0] Cache kernel.

* [ORO-0] Support newer HIP builds on windows (#22)

* [ORO-0] Unit test. (#23)

* Fix LDS scan bug.
The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block).
Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work
because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap).

Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize the LDS scan algorithm. (#6)

* Optimize the LDS scan algorithm.
This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Support an input array in LDS that is 2 times the WG size.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 clean up (#7)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* Feature/oro 0 clean up (#10)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* [ORO-0] SortKernel1. Less complex. (#8)

SortKernel (occupancy: 8)
- vgpr: 128
- lds: 6704
SortKernel1 (occupancy: 9)
- vgpr: 106
- lds 7720

* [ORO-0] Kernel execution time check.

* Fix the memory access pattern and change it to coalesced memory access. (#11)

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Single kernel sort for small keys. (#12)

* Optimize the Count kernel for less LDS usage to achieve full occupancy (#13)

* Optimize the Count kernel to let it use less LDS and could achieve full occupancy.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Remove __threadfence_block()

Removes the boundary check in the inner loop.
The upper bound is set only once before going into the loop.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Introduce DRIVER and RTC APIs

* Disable enum-variant

* Improve paths

* Add fields

* Update Vulkan test

* Define CUDA in terms of DRIVER and RTC

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

* Merging another merge (#18)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15)

* Calculate the number of WGs based on LDS and max-thread-per-WGP.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Add a workaround for CUDA.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Implement key-value pair sorting (#17)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Add basic structure for key-value pair sorting.
Fix an error in single pass sort


Signed-off-by: Chih-Chen Kao <[email protected]>

* Add Value data in the test and sort it according to keys.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Support Key only sorting.

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Make single pass kernel non compile time switch.

* Support both Key-Only & Key-Value pair sort kernels


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Test change.

* [ORO-0] A bug.

* [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible.

Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>

* [ORO-0] Revert demo code.

* Fix missing CUDA properties.  (#26)

* Update Orochi.cpp

* [ORO-0] Clean up.

* [ORO-0] OroUtils. (#27)

* [ORO-0] OroUtils.

* [ORO-0] Linux build fix.

* [ORO-0] Forgot to add.

* [ORO-0] Linux build fix.

* [ORO-0] Clean up.

Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>

* Add kernel path and include dir to the functions. (#20)

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] BakeKernel. (#21)

* [ORO-0] BakeKernel.

* Update tools/genArgs.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/genArgs.py

dead code removal

* Update tools/stringify.py

dead code removal

* fix include

Signed-off-by: Chih-Chen Kao <[email protected]>

* fix script

Signed-off-by: Chih-Chen Kao <[email protected]>

* fix

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Fix Orochi CUDA API (#23)

Fix Orochi CUDA API 

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Linux build fix. (#22)

* [ORO-0] Linux build fix.

* Fix Orochi CUDA API


Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Quick fix for old linux gcc which does not support std::exclusive_scan (#24)

Quick fix for old linux gcc which does not support std::exclusive_scan

Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix the kernel cache bug. (#25)

Fix the kernel cache bug.

The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid.

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Remove static variables. (#26)

* [ORO-0] Remove static variables.

* [ORO-0] Applied the suggestions.

* [ORO-0] Linux regression fix.

* Fix OrochiUtils::getFunctionFromString API (#27)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Adding missing assert (#28)

* Adding missing assert

* Adding more asserts

* Feature/oro 0 gpuopen merge (#31)

* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <[email protected]>

* fix footnote markdown format (#39)

* Fix orochi utils issue in unit tests

Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>

Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
Co-authored-by: Daniel Meister <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>

* [ORO-0] bitcode/cubin linking APIs (#40)

* [ORO-0] Link apis.

* [ORO-0] Forgot to add.

* [ORO-0] Linking test.

* [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize

* [ORO-0] Update link unit tests with comments

* [ORO-0] Change test for CUBIN instead of PTX

* [ORO-0] Fix loadfile to use binary mode, remove printf in kernel

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Load amdhip first, then hiprtc.

* [ORO-0] Remove assert from hiprtc library checks

* [ORO-0] Add gfx1030 bitcode for navi21

* [MNN-0] Fix premake and add more link testcases

* [ORO-0] Update a link_null_name testcase

* [ORO-0] Make unit tests more stable on CUDA

* [ORO-0] Update bitcode for gfx1030

* [ORO-0] Add bitcodes for navi1,2, vega

* [ORO-0] Add hiprtc.dll and comgr dll

* [ORO-0] Add gfx906 bitcodes

* [ORO-0] Support unit tests on both HIP and CUDA

* [ORO-0] Update dlls and bitcodes

* [ORO-0] Update bitcodes and generation script

* [ORO-0] Minor fixes in bundled bitcode unit tests

* [ORO-0] Fix typo in options

* [ORO-0] Fix getCUBIN/PTX signatures

* [ORO-0] Fix unit tests and generate fatbin for CUDA

* [ORO-0] Regenerate fatbin and fix script

* [ORO-0] Cleanup

* [ORO-0] Update bundled bitcodes to only contain navi21 for now

* [ORO-0] Updated bundled bitcode

* [ORO-0] add ORO_LAUNCH_PARAMS_*

* [ORO-0] Add unit test for orortcLinkAddFile

* [ORO-0] Add unittest scripts for TC

* [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA

* [ORO-0] Add bitcode+bundled bitcode link test

* [ORO-0] Cleanup

* [ORO-0] Fix typo in script

* [ORO-0] Update linux TC script

Co-authored-by: takahiroharada <[email protected]>

* [ORO-0] Get global memory size for CUDA (#44)

* [ORO-0] Update HIP dll's for bitcode linking support

* [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46)

* [ORO-0] Update HIP dll's for bitcode linking support

* [ORO-0] Add getLoweredName testcase

* [ORO-0] Update unittest filter

* [ORO-0] Update loweredName test

* [ORO-0] Add missing test kernel

* [ORO-0] Fix loweredName test

* [ORO-0] Fix linux compilation

* [ORO-0] Remove printf from test kernel (#37)

* [ORO-0] Allow usage of libhiprtc64.so if exists

* [ORO-0] Fix linux loading of libhiprtc.so

Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
Co-authored-by: Daniel Meister <[email protected]>
Co-authored-by: PixelClear <[email protected]>

* Feature/oro 0 radix sort stream (#34)

* Initial commit

* Streams to the configuration

* Mutex in OrochiUtils

* Feature/oro 0 radix sort mutex baking (#36)

* Locking other methods in OrochiUtils

* Removing mutex from static methods

* Making mutex and map static

* Removing static from OrochiUtils

* Removing static from OrochiUtils

* Support Precompiled Kernels in Orochi (#37)

* Add bitcode support: getFunctionFromPrecompiledBinary

Signed-off-by: Chih-Chen Kao <[email protected]>

* Add bitcode and the script to generate it.


Signed-off-by: Chih-Chen Kao <[email protected]>

* rewrite OROASSERT. Fix include file order.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Use string instead of const char*


Signed-off-by: Chih-Chen Kao <[email protected]>

* Rename the option from bitcode to precompiled


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Add bitcode script for nvidia fatbin

* [ORO-0] CUDA - hipfb->fatbin rename

Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>

* Feature/oro 0 resource limits (#38)

* Adding limit functions

* Removing enum

* Removing enum

* Limit enum

* char string Windows API (#39)

* [ORO-0] Update precompiled radix sort kernels to use -ffast-math (#42)

* [ORO-0] Update precompiled radix sort kernels to use -ffast-math

* [ORO-0] Update RadixSort fatbin for NVIDIA and use fast math

* [ORO-0] Function pointer test. (#40)

* [ORO-0] Function pointer test.

* [ORO-0] launch2d.

* [ORO-0] Event, OroStopwatch.

* Implement GpuMemory to handle device memory operations.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Sync with GPUOpen/LibrariesAndSDKs/Orochi (#44)

* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <[email protected]>

* fix footnote markdown format (#39)

* Feature/oro 0 amdadvtech merge (#43)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 radix sort (#19)

* [ORO-0] Working 8 bit radix sort.

* [ORO-0] Some optimization.

* Create LICENSE

* Update README.md (#15)

* Feature/oro 0 raw get set (#19)

* [ORO-0] Rename setter and getter.

* [ORO-0] Fix when there is a dll but no device.

* [ORO-0] Deletion function.

* [ORO-0] Multi processor count.

* [ORO-0] Extended the sort to more than 8 bits. Implemented tests.

* [ORO-0] Moved temp buffer allocation out from the sort().

* [ORO-0] README. References.

* [ORO-0] Debug flag.

* Refactor the code to add the basic constructs to support selecting different scan algorithms.
Add different implementation of the scan algorithm: CPU, single WG and all WG .

Signed-off-by: Chih-Chen Kao <[email protected]>

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* Optimization: Implement the single-pass kernel for GPU parallel scan.
Fix a GPU memory bug.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 kernel cache (#4)

* [ORO-0] Cache kernel.

* [ORO-0] Support newer HIP builds on windows (#22)

* [ORO-0] Unit test. (#23)

* Fix LDS scan bug.
The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block).
Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work
because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap).

Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize the LDS scan algorithm. (#6)

* Optimize the LDS scan algorithm.
This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Support an input array in LDS that is 2 times the WG size.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 clean up (#7)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* Feature/oro 0 clean up (#10)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* [ORO-0] SortKernel1. Less complex. (#8)

SortKernel (occupancy: 8)
- vgpr: 128
- lds: 6704
SortKernel1 (occupancy: 9)
- vgpr: 106
- lds 7720

* [ORO-0] Kernel execution time check.

* Fix the memory access pattern and change it to coalesced memory access. (#11)

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Single kernel sort for small keys. (#12)

* Optimize the Count kernel for less LDS usage to achieve full occupancy (#13)

* Optimize the Count kernel to let it use less LDS and could achieve full occupancy.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Remove __threadfence_block()

Removes the boundary check in the inner loop.
The upper bound is set only once before going into the loop.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Introduce DRIVER and RTC APIs

* Disable enum-variant

* Improve paths

* Add fields

* Update Vulkan test

* Define CUDA in terms of DRIVER and RTC

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

* Merging another merge (#18)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15)

* Calculate the number of WGs based on LDS and max-thread-per-WGP.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Add a workaround for CUDA.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Implement key-value pair sorting (#17)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Add basic structure for key-value pair sorting.
Fix an error in single pass sort


Signed-off-by: Chih-Chen Kao <[email protected]>

* Add Value data in the test and sort it according to keys.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Support Key only sorting.

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Make single pass kernel non compile time switch.

* Support both Key-Only & Key-Value pair sort kernels


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Test change.

* [ORO-0] A bug.

* [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible.

Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>

* [ORO-0] Revert demo code.

* Fix missing CUDA properties.  (#26)

* Update Orochi.cpp

* [ORO-0] Clean up.

* [ORO-0] OroUtils. (#27)

* [ORO-0] OroUtils.

* [ORO-0] Linux build fix.

* [ORO-0] Forgot to add.

* [ORO-0] Linux build fix.

* [ORO-0] Clean up.

Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>

* Add kernel path and include dir to the functions. (#20)

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] BakeKernel. (#21)

* [ORO-0] BakeKernel.

* Update tools/genArgs.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/genArgs.py

dead code removal

* Update tools/stringify.py

dead code removal

* fix include

Signed-off-by: Chih-Chen Kao <[email protected]>

* fix script

Signed-off-by: Chih-Chen Kao <[email protected]>

* fix

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Fix Orochi CUDA API (#23)

Fix Orochi CUDA API 

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Linux build fix. (#22)

* [ORO-0] Linux build fix.

* Fix Orochi CUDA API


Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Quick fix for old linux gcc which does not support std::exclusive_scan (#24)

Quick fix for old linux gcc which does not support std::exclusive_scan

Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix the kernel cache bug. (#25)

Fix the kernel cache bug.

The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid.

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Remove static variables. (#26)

* [ORO-0] Remove static variables.

* [ORO-0] Applied the suggestions.

* [ORO-0] Linux regression fix.

* Fix OrochiUtils::getFunctionFromString API (#27)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Adding missing assert (#28)

* Adding missing assert

* Adding more asserts

* Feature/oro 0 gpuopen merge (#31)

* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <[email protected]>

* fix footnote markdown format (#39)

* Fix orochi utils issue in unit tests

Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>

Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
Co-authored-by: Daniel Meister <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>

* [ORO-0] bitcode/cubin linking APIs (#40)

* [ORO-0] Link apis.

* [ORO-0] Forgot to add.

* [ORO-0] Linking test.

* [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize

* [ORO-0] Update link unit tests with comments

* [ORO-0] Change test for CUBIN instead of PTX

* [ORO-0] Fix loadfile to use binary mode, remove printf in kernel

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Load amdhip first, then hiprtc.

* [ORO-0] Remove assert from hiprtc library checks

* [ORO-0] Add gfx1030 bitcode for navi21

* [MNN-0] Fix premake and add more link testcases

* [ORO-0] Update a link_null_name testcase

* [ORO-0] Make unit tests more stable on CUDA

* [ORO-0] Update bitcode for gfx1030

* [ORO-0] Add bitcodes for navi1,2, vega

* [ORO-0] Add hiprtc.dll and comgr dll

* [ORO-0] Add gfx906 bitcodes

* [ORO-0] Support unit tests on both HIP and CUDA

* [ORO-0] Update dlls and bitcodes

* [ORO-0] Update bitcodes and generation script

* [ORO-0] Minor fixes in bundled bitcode unit tests

* [ORO-0] Fix typo in options

* [ORO-0] Fix getCUBIN/PTX signatures

* [ORO-0] Fix unit tests and generate fatbin for CUDA

* [ORO-0] Regenerate fatbin and fix script

* [ORO-0] Cleanup

* [ORO-0] Update bundled bitcodes to only contain navi21 for now

* [ORO-0] Updated bundled bitcode

* [ORO-0] add ORO_LAUNCH_PARAMS_*

* [ORO-0] Add unit test for orortcLinkAddFile

* [ORO-0] Add unittest scripts for TC

* [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA

* [ORO-0] Add bitcode+bundled bitcode link test

* [ORO-0] Cleanup

* [ORO-0] Fix typo in script

* [ORO-0] Update linux TC script

Co-authored-by: takahiroharada <[email protected]>

* [ORO-0] Get global memory size for CUDA (#44)

* [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46)

* [ORO-0] Update HIP dll's for bitcode linking support

* [ORO-0] Add getLoweredName testcase

* [ORO-0] Update unittest filter

* [ORO-0] Update loweredName test

* [ORO-0] Add missing test kernel

* [ORO-0] Fix loweredName test

* [ORO-0] Fix linux compilation

* [ORO-0] Remove printf from test kernel (#37)

* [ORO-0] Fix linux loading of libhiprtc.so (#49)

* [ORO-0] Update test scripts (#50)

* [ORO-0] Update scripts for linux (#51)

* [ORO-0] Add new scripts (#52)

* [ORO-0] Add new scripts

* [ORO-0] Add execute permissions to scripts

* Fix Unit Test: getErrorString (#54)

Signed-off-by: Chih-Chen Kao <[email protected]>

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Support hiprtc0504 (#55)

* [ORO-0] Update hiprtc and orortc error codes (#57)

* [ORO-0] Update test scripts to delete cache before running (#58)

* [ORO-0] Update hiprtc dlls

* [ORO-0] Support gfx1100,gfx1102 for radix sort kernel precompilation

* Fix apt python installation (#63)

Update checkout version


Signed-off-by: Chih-Chen Kao <[email protected]>

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] OrochiUtils update. (#61)

* [ORO-0] Add WMMA test (#62)

* [ORO-0] Add WMMA test

* [ORO-0] Add a comment for WMMA

* [ORO-0] Cleanup

* [ORO-0] Add a couple more comments

* [ORO-0] Remove hip_runtime include

* [ORO-0] Cleanup

* [ORO-0] Fix comment

* [ORO-0] Add Copyright notice

* [ORO-0] Load binary from the directory where DLL is.

* [ORO-0] Fix for linux.

---------

Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
Co-authored-by: Daniel Meister <[email protected]>
Co-authored-by: PixelClear <[email protected]>

* [ORO-0] Remove unnecessary template.

* [ORO-0] Clean up. Added python script kernelCompile.py for compilation. (#46)

* [ORO-0] Clean up. Added python script kernelCompile.py for compilation.

* [ORO-0] hipsdk should be next to orochi dir.

* Update ParallelPrimitives/RadixSortKernels.h

Remove commented line

---------

Co-authored-by: Chih-Chen Kao <[email protected]>

* [ORO-0] add automatic arch selection (#47)

* [ORO-0] add automatic arch selection

* [ORO-0] Refactor and error output when it cannot find llc.

---------

Co-authored-by: takahiroharada <[email protected]>

* Feature/oro 0 flexible rtc error handling cherrypick (#48)

* add a handler for RTC load failure case on cuda.

* [ORO-0] add a handler for RTC load failure case on hip.

* [ORO-0] add cuda 12.0 sdk in nvrtc path

* [ORO-0] Remove non bundled bitcode tests. Clean up.

* [ORO-0] Clean up.

* [ORO-0] Add hiprtcGetBitcodeSize back.

* Update Orochi.cpp

* Update Orochi.cpp

* [ORO-0] Fix for multi-GPU/iGPU

* [HIPSDK-0] compute-22.40-osdb/36/

* [ORO-0] compute-23.10-osdb/9/

* [ORO-0] Update dll names

* [ORO-0] implement new test for managed memory, enable managed memory api, fix all warnings and cleanup

* [ORO-0] fix compile issues

* [ORO-0] fix declaration of oroManagedMalloc

* [ORO-0] change streaming kernel

* [ORO-0] enable it on windows too

* [ORO-0] add more asserts

* [ORO-0] update kernel

* [ORO-0] add host copy times

* [ORO-0] add malloc times

* Refactor Count

Signed-off-by: Chih-Chen Kao <[email protected]>

* Refactor Radix Sort class:

- Now the tmp buffer is allocated internally.
- All GPU memory buffers are changed to the GpuMemory class
- `configure` will now calculate the total number of GPU blocks for the count and the scan kernel
- The client does not need to call configure explicitly
- Refactor function parameters
- Remove count reference kernel



Signed-off-by: Chih-Chen Kao <[email protected]>

* Add `const`




Signed-off-by: Chih-Chen Kao <[email protected]>

* Thid commit does the followings:

- Support setting the the number of thread per block (a.k.a block size) dynamically
- Refactor `exclusiveScanCpu`
- Extend `printKernelInfo`.



Signed-off-by: Chih-Chen Kao <[email protected]>

* The 1st working example for the radix sort optimization


Signed-off-by: Chih-Chen Kao <[email protected]>

* Support configuring dynamic "NUM_WARPS_PER_BLOCK" in the sort kernel

Compute the optimal number of inputs for each block to handle.

Refactor the usage of stopwatch

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] add hiprtc future dll names in hiprtc path

* Add linux paths and dll names (#66)

* [ORO-0] Change path and rtc dll names

* [ORO-0] Make scripts executable

* [ORO-0] Add hiprtc path

* [ORO-0] Remove ParallelPrimitives, test/radix sort

* [ORO-0] Edit premake

---------

Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
Co-authored-by: Daniel Meister <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>
Co-authored-by: Richard Geslot <[email protected]>
Co-authored-by: Atsushi Yoshimura <[email protected]>
Co-authored-by: Atsushi.Yoshimura <[email protected]>
@nat42
Copy link

nat42 commented Jun 14, 2024

Stuff like this really exemplifies the issue I have with AMD's attitude with GPUOpen, it feels much more about being opensource as a bullet point than making the tools developers would want to be API agnostic and support all major GPU vendors.

AMD wants to compete with Nvidia and unlock some of the marketshare, but developers if they care, just want to not be locked in and have code that works everywhere. That this project doesn't see a reason to make OneAPI (or CPU) backends work as a priority (or bring them up to the 2.0 code) dooms this to being for those devs that care to support AMD specifically. It seems just as self serving as Nvidia's vendor lock in, except that they have the dominant position and AMD is playing catchup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants