-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OneAPI support? #18
Comments
Orochi is designed to add more backends if needed. Adding intel support makes sense to complete the project. We are happy to work with anyone who's interested in adding it although we don't have any plan to add it at the moment. |
* [ORO-0] Working 8 bit radix sort. * [ORO-0] Some optimization. * Create LICENSE * Update README.md (#15) * Feature/oro 0 raw get set (#19) * [ORO-0] Rename setter and getter. * [ORO-0] Fix when there is a dll but no device. * [ORO-0] Deletion function. * [ORO-0] Multi processor count. * [ORO-0] Extended the sort to more than 8 bits. Implemented tests. * [ORO-0] Moved temp buffer allocation out from the sort(). * [ORO-0] README. References. * [ORO-0] Debug flag. * Refactor the code to add the basic constructs to support selecting different scan algorithms. Add different implementation of the scan algorithm: CPU, single WG and all WG . Signed-off-by: Chih-Chen Kao <[email protected]> * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * Optimization: Implement the single-pass kernel for GPU parallel scan. Fix a GPU memory bug. Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 kernel cache (#4) * [ORO-0] Cache kernel. * [ORO-0] Support newer HIP builds on windows (#22) * [ORO-0] Unit test. (#23) * Fix LDS scan bug. The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block). Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap). Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize the LDS scan algorithm. (#6) * Optimize the LDS scan algorithm. This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size. Signed-off-by: Chih-Chen Kao <[email protected]> * Support an input array in LDS that is 2 times the WG size. Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 clean up (#7) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * Feature/oro 0 clean up (#10) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * [ORO-0] SortKernel1. Less complex. (#8) SortKernel (occupancy: 8) - vgpr: 128 - lds: 6704 SortKernel1 (occupancy: 9) - vgpr: 106 - lds 7720 * [ORO-0] Kernel execution time check. * Fix the memory access pattern and change it to coalesced memory access. (#11) Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Single kernel sort for small keys. (#12) * Optimize the Count kernel for less LDS usage to achieve full occupancy (#13) * Optimize the Count kernel to let it use less LDS and could achieve full occupancy. Signed-off-by: Chih-Chen Kao <[email protected]> * Remove __threadfence_block() Removes the boundary check in the inner loop. The upper bound is set only once before going into the loop. Signed-off-by: Chih-Chen Kao <[email protected]> * Introduce DRIVER and RTC APIs * Disable enum-variant * Improve paths * Add fields * Update Vulkan test * Define CUDA in terms of DRIVER and RTC * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: takahiroharada <[email protected]> * Merging another merge (#18) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> * Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15) * Calculate the number of WGs based on LDS and max-thread-per-WGP. Signed-off-by: Chih-Chen Kao <[email protected]> * Add a workaround for CUDA. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Implement key-value pair sorting (#17) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <[email protected]> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <[email protected]> * Add basic structure for key-value pair sorting. Fix an error in single pass sort Signed-off-by: Chih-Chen Kao <[email protected]> * Add Value data in the test and sort it according to keys. Signed-off-by: Chih-Chen Kao <[email protected]> * Support Key only sorting. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Make single pass kernel non compile time switch. * Support both Key-Only & Key-Value pair sort kernels Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Test change. * [ORO-0] A bug. * [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible. Co-authored-by: Takahiro Harada <[email protected]> Co-authored-by: takahiroharada <[email protected]> * [ORO-0] Revert demo code. * Fix missing CUDA properties. (#26) * Update Orochi.cpp * [ORO-0] Clean up. * [ORO-0] OroUtils. (#27) * [ORO-0] OroUtils. * [ORO-0] Linux build fix. * [ORO-0] Forgot to add. * [ORO-0] Linux build fix. * [ORO-0] Clean up. Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]>
* Add gitignore to the repository Signed-off-by: Chih-Chen Kao <[email protected]> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 radix sort (#19) * [ORO-0] Working 8 bit radix sort. * [ORO-0] Some optimization. * Create LICENSE * Update README.md (#15) * Feature/oro 0 raw get set (#19) * [ORO-0] Rename setter and getter. * [ORO-0] Fix when there is a dll but no device. * [ORO-0] Deletion function. * [ORO-0] Multi processor count. * [ORO-0] Extended the sort to more than 8 bits. Implemented tests. * [ORO-0] Moved temp buffer allocation out from the sort(). * [ORO-0] README. References. * [ORO-0] Debug flag. * Refactor the code to add the basic constructs to support selecting different scan algorithms. Add different implementation of the scan algorithm: CPU, single WG and all WG . Signed-off-by: Chih-Chen Kao <[email protected]> * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * Optimization: Implement the single-pass kernel for GPU parallel scan. Fix a GPU memory bug. Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 kernel cache (#4) * [ORO-0] Cache kernel. * [ORO-0] Support newer HIP builds on windows (#22) * [ORO-0] Unit test. (#23) * Fix LDS scan bug. The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block). Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap). Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize the LDS scan algorithm. (#6) * Optimize the LDS scan algorithm. This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size. Signed-off-by: Chih-Chen Kao <[email protected]> * Support an input array in LDS that is 2 times the WG size. Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 clean up (#7) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * Feature/oro 0 clean up (#10) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * [ORO-0] SortKernel1. Less complex. (#8) SortKernel (occupancy: 8) - vgpr: 128 - lds: 6704 SortKernel1 (occupancy: 9) - vgpr: 106 - lds 7720 * [ORO-0] Kernel execution time check. * Fix the memory access pattern and change it to coalesced memory access. (#11) Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Single kernel sort for small keys. (#12) * Optimize the Count kernel for less LDS usage to achieve full occupancy (#13) * Optimize the Count kernel to let it use less LDS and could achieve full occupancy. Signed-off-by: Chih-Chen Kao <[email protected]> * Remove __threadfence_block() Removes the boundary check in the inner loop. The upper bound is set only once before going into the loop. Signed-off-by: Chih-Chen Kao <[email protected]> * Introduce DRIVER and RTC APIs * Disable enum-variant * Improve paths * Add fields * Update Vulkan test * Define CUDA in terms of DRIVER and RTC * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: takahiroharada <[email protected]> * Merging another merge (#18) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> * Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15) * Calculate the number of WGs based on LDS and max-thread-per-WGP. Signed-off-by: Chih-Chen Kao <[email protected]> * Add a workaround for CUDA. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Implement key-value pair sorting (#17) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <[email protected]> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <[email protected]> * Add basic structure for key-value pair sorting. Fix an error in single pass sort Signed-off-by: Chih-Chen Kao <[email protected]> * Add Value data in the test and sort it according to keys. Signed-off-by: Chih-Chen Kao <[email protected]> * Support Key only sorting. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Make single pass kernel non compile time switch. * Support both Key-Only & Key-Value pair sort kernels Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Test change. * [ORO-0] A bug. * [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible. Co-authored-by: Takahiro Harada <[email protected]> Co-authored-by: takahiroharada <[email protected]> * [ORO-0] Revert demo code. * Fix missing CUDA properties. (#26) * Update Orochi.cpp * [ORO-0] Clean up. * [ORO-0] OroUtils. (#27) * [ORO-0] OroUtils. * [ORO-0] Linux build fix. * [ORO-0] Forgot to add. * [ORO-0] Linux build fix. * [ORO-0] Clean up. Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> * Add kernel path and include dir to the functions. (#20) Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] BakeKernel. (#21) * [ORO-0] BakeKernel. * Update tools/genArgs.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/genArgs.py dead code removal * Update tools/stringify.py dead code removal * fix include Signed-off-by: Chih-Chen Kao <[email protected]> * fix script Signed-off-by: Chih-Chen Kao <[email protected]> * fix Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Fix Orochi CUDA API (#23) Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Linux build fix. (#22) * [ORO-0] Linux build fix. * Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Quick fix for old linux gcc which does not support std::exclusive_scan (#24) Quick fix for old linux gcc which does not support std::exclusive_scan Signed-off-by: Chih-Chen Kao <[email protected]> * Fix the kernel cache bug. (#25) Fix the kernel cache bug. The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Remove static variables. (#26) * [ORO-0] Remove static variables. * [ORO-0] Applied the suggestions. * [ORO-0] Linux regression fix. * Fix OrochiUtils::getFunctionFromString API (#27) Signed-off-by: Chih-Chen Kao <[email protected]> * Adding missing assert (#28) * Adding missing assert * Adding more asserts * Feature/oro 0 gpuopen merge (#31) * Fix oroGetDeviceProperties in cuda path. * Fix linux crash (#29) * [ORO-0] Added missing file. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Fix hipGetErrorString (#32) * [ORO-0] Fix hipGetErrorString It was incorrectly importing this API. Import the correct API in hipew. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33) * Add missing path on Apple config. (#34) * [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38) * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Add hiprtc.dll and comgr dll Co-authored-by: takahiroharada <[email protected]> * fix footnote markdown format (#39) * Fix orochi utils issue in unit tests Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: PixelClear <[email protected]> Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> Co-authored-by: Daniel Meister <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: PixelClear <[email protected]>
* Fix oroGetDeviceProperties in cuda path. * Fix linux crash (#29) * [ORO-0] Added missing file. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Fix hipGetErrorString (#32) * [ORO-0] Fix hipGetErrorString It was incorrectly importing this API. Import the correct API in hipew. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33) * Add missing path on Apple config. (#34) * [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38) * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Add hiprtc.dll and comgr dll Co-authored-by: takahiroharada <[email protected]> * fix footnote markdown format (#39) * Feature/oro 0 amdadvtech merge (#43) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <[email protected]> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 radix sort (#19) * [ORO-0] Working 8 bit radix sort. * [ORO-0] Some optimization. * Create LICENSE * Update README.md (#15) * Feature/oro 0 raw get set (#19) * [ORO-0] Rename setter and getter. * [ORO-0] Fix when there is a dll but no device. * [ORO-0] Deletion function. * [ORO-0] Multi processor count. * [ORO-0] Extended the sort to more than 8 bits. Implemented tests. * [ORO-0] Moved temp buffer allocation out from the sort(). * [ORO-0] README. References. * [ORO-0] Debug flag. * Refactor the code to add the basic constructs to support selecting different scan algorithms. Add different implementation of the scan algorithm: CPU, single WG and all WG . Signed-off-by: Chih-Chen Kao <[email protected]> * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * Optimization: Implement the single-pass kernel for GPU parallel scan. Fix a GPU memory bug. Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 kernel cache (#4) * [ORO-0] Cache kernel. * [ORO-0] Support newer HIP builds on windows (#22) * [ORO-0] Unit test. (#23) * Fix LDS scan bug. The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block). Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap). Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize the LDS scan algorithm. (#6) * Optimize the LDS scan algorithm. This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size. Signed-off-by: Chih-Chen Kao <[email protected]> * Support an input array in LDS that is 2 times the WG size. Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 clean up (#7) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * Feature/oro 0 clean up (#10) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * [ORO-0] SortKernel1. Less complex. (#8) SortKernel (occupancy: 8) - vgpr: 128 - lds: 6704 SortKernel1 (occupancy: 9) - vgpr: 106 - lds 7720 * [ORO-0] Kernel execution time check. * Fix the memory access pattern and change it to coalesced memory access. (#11) Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Single kernel sort for small keys. (#12) * Optimize the Count kernel for less LDS usage to achieve full occupancy (#13) * Optimize the Count kernel to let it use less LDS and could achieve full occupancy. Signed-off-by: Chih-Chen Kao <[email protected]> * Remove __threadfence_block() Removes the boundary check in the inner loop. The upper bound is set only once before going into the loop. Signed-off-by: Chih-Chen Kao <[email protected]> * Introduce DRIVER and RTC APIs * Disable enum-variant * Improve paths * Add fields * Update Vulkan test * Define CUDA in terms of DRIVER and RTC * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: takahiroharada <[email protected]> * Merging another merge (#18) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> * Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15) * Calculate the number of WGs based on LDS and max-thread-per-WGP. Signed-off-by: Chih-Chen Kao <[email protected]> * Add a workaround for CUDA. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Implement key-value pair sorting (#17) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <[email protected]> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <[email protected]> * Add basic structure for key-value pair sorting. Fix an error in single pass sort Signed-off-by: Chih-Chen Kao <[email protected]> * Add Value data in the test and sort it according to keys. Signed-off-by: Chih-Chen Kao <[email protected]> * Support Key only sorting. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Make single pass kernel non compile time switch. * Support both Key-Only & Key-Value pair sort kernels Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Test change. * [ORO-0] A bug. * [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible. Co-authored-by: Takahiro Harada <[email protected]> Co-authored-by: takahiroharada <[email protected]> * [ORO-0] Revert demo code. * Fix missing CUDA properties. (#26) * Update Orochi.cpp * [ORO-0] Clean up. * [ORO-0] OroUtils. (#27) * [ORO-0] OroUtils. * [ORO-0] Linux build fix. * [ORO-0] Forgot to add. * [ORO-0] Linux build fix. * [ORO-0] Clean up. Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> * Add kernel path and include dir to the functions. (#20) Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] BakeKernel. (#21) * [ORO-0] BakeKernel. * Update tools/genArgs.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/genArgs.py dead code removal * Update tools/stringify.py dead code removal * fix include Signed-off-by: Chih-Chen Kao <[email protected]> * fix script Signed-off-by: Chih-Chen Kao <[email protected]> * fix Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Fix Orochi CUDA API (#23) Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Linux build fix. (#22) * [ORO-0] Linux build fix. * Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Quick fix for old linux gcc which does not support std::exclusive_scan (#24) Quick fix for old linux gcc which does not support std::exclusive_scan Signed-off-by: Chih-Chen Kao <[email protected]> * Fix the kernel cache bug. (#25) Fix the kernel cache bug. The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Remove static variables. (#26) * [ORO-0] Remove static variables. * [ORO-0] Applied the suggestions. * [ORO-0] Linux regression fix. * Fix OrochiUtils::getFunctionFromString API (#27) Signed-off-by: Chih-Chen Kao <[email protected]> * Adding missing assert (#28) * Adding missing assert * Adding more asserts * Feature/oro 0 gpuopen merge (#31) * Fix oroGetDeviceProperties in cuda path. * Fix linux crash (#29) * [ORO-0] Added missing file. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Fix hipGetErrorString (#32) * [ORO-0] Fix hipGetErrorString It was incorrectly importing this API. Import the correct API in hipew. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33) * Add missing path on Apple config. (#34) * [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38) * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Add hiprtc.dll and comgr dll Co-authored-by: takahiroharada <[email protected]> * fix footnote markdown format (#39) * Fix orochi utils issue in unit tests Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: PixelClear <[email protected]> Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> Co-authored-by: Daniel Meister <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: PixelClear <[email protected]> * [ORO-0] bitcode/cubin linking APIs (#40) * [ORO-0] Link apis. * [ORO-0] Forgot to add. * [ORO-0] Linking test. * [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize * [ORO-0] Update link unit tests with comments * [ORO-0] Change test for CUBIN instead of PTX * [ORO-0] Fix loadfile to use binary mode, remove printf in kernel * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Load amdhip first, then hiprtc. * [ORO-0] Remove assert from hiprtc library checks * [ORO-0] Add gfx1030 bitcode for navi21 * [MNN-0] Fix premake and add more link testcases * [ORO-0] Update a link_null_name testcase * [ORO-0] Make unit tests more stable on CUDA * [ORO-0] Update bitcode for gfx1030 * [ORO-0] Add bitcodes for navi1,2, vega * [ORO-0] Add hiprtc.dll and comgr dll * [ORO-0] Add gfx906 bitcodes * [ORO-0] Support unit tests on both HIP and CUDA * [ORO-0] Update dlls and bitcodes * [ORO-0] Update bitcodes and generation script * [ORO-0] Minor fixes in bundled bitcode unit tests * [ORO-0] Fix typo in options * [ORO-0] Fix getCUBIN/PTX signatures * [ORO-0] Fix unit tests and generate fatbin for CUDA * [ORO-0] Regenerate fatbin and fix script * [ORO-0] Cleanup * [ORO-0] Update bundled bitcodes to only contain navi21 for now * [ORO-0] Updated bundled bitcode * [ORO-0] add ORO_LAUNCH_PARAMS_* * [ORO-0] Add unit test for orortcLinkAddFile * [ORO-0] Add unittest scripts for TC * [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA * [ORO-0] Add bitcode+bundled bitcode link test * [ORO-0] Cleanup * [ORO-0] Fix typo in script * [ORO-0] Update linux TC script Co-authored-by: takahiroharada <[email protected]> * [ORO-0] Get global memory size for CUDA (#44) * [ORO-0] Update HIP dll's for bitcode linking support * [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46) * [ORO-0] Update HIP dll's for bitcode linking support * [ORO-0] Add getLoweredName testcase * [ORO-0] Update unittest filter * [ORO-0] Update loweredName test * [ORO-0] Add missing test kernel * [ORO-0] Fix loweredName test * [ORO-0] Fix linux compilation * [ORO-0] Remove printf from test kernel (#37) * [ORO-0] Allow usage of libhiprtc64.so if exists * [ORO-0] Fix linux loading of libhiprtc.so Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Takahiro Harada <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> Co-authored-by: Daniel Meister <[email protected]> Co-authored-by: PixelClear <[email protected]>
I'm going to be attempting to tackle this or a portion thereof for a passion project I've started to take all the Cycles rendering backends and port them to Orochi while attempting to use Orochi to implement more. CUDA/Optix, RoCM/Vulkan RT, resurrecting OpenCL as an option with RT (Maybe). Right now the most difficult part is deciding how/why/who/what where when there's big projects and sweeping changes ongoing in this arena, Makes a guy want to turn into a syclpath. Passion project goal is to reduce Cycles renderer code from 4 implementations to 1 and then ask Orochi to conduct the symphony to make that 1 work on whatever hardware is presented. So more of a 1 headed, 8 tailed dragon. |
Actually, I started one api level zero hook up to Orochi while ago (which I haven't had time to go back and finish up), which was working fine although there are some stuff which didn't go as clean as I want as the model is slightly different. |
@takahiroharada nice! Can publish even the “unfinished work” as a branch on public orochi github repo? so someone can start from there and iterate on your work.. |
@oscarbg FYI, it's in our fork right now but I can push the branch to this orochi repo. https://github.com/amdadvtech/Orochi/tree/feature/ORO-0-oneapi Not all tests are working but basic and important ones are working
|
Nice! Will take a look.. thanks.. |
The change was copied to a branch in this repo. https://github.com/GPUOpen-LibrariesAndSDKs/Orochi/tree/feature/ORO-0-oneapi |
* Fix oroGetDeviceProperties in cuda path. * Fix linux crash (#29) * [ORO-0] Added missing file. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Fix hipGetErrorString (#32) * [ORO-0] Fix hipGetErrorString It was incorrectly importing this API. Import the correct API in hipew. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33) * Add missing path on Apple config. (#34) * [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38) * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Add hiprtc.dll and comgr dll Co-authored-by: takahiroharada <[email protected]> * fix footnote markdown format (#39) * Feature/oro 0 amdadvtech merge (#43) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <[email protected]> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 radix sort (#19) * [ORO-0] Working 8 bit radix sort. * [ORO-0] Some optimization. * Create LICENSE * Update README.md (#15) * Feature/oro 0 raw get set (#19) * [ORO-0] Rename setter and getter. * [ORO-0] Fix when there is a dll but no device. * [ORO-0] Deletion function. * [ORO-0] Multi processor count. * [ORO-0] Extended the sort to more than 8 bits. Implemented tests. * [ORO-0] Moved temp buffer allocation out from the sort(). * [ORO-0] README. References. * [ORO-0] Debug flag. * Refactor the code to add the basic constructs to support selecting different scan algorithms. Add different implementation of the scan algorithm: CPU, single WG and all WG . Signed-off-by: Chih-Chen Kao <[email protected]> * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * Optimization: Implement the single-pass kernel for GPU parallel scan. Fix a GPU memory bug. Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 kernel cache (#4) * [ORO-0] Cache kernel. * [ORO-0] Support newer HIP builds on windows (#22) * [ORO-0] Unit test. (#23) * Fix LDS scan bug. The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block). Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap). Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize the LDS scan algorithm. (#6) * Optimize the LDS scan algorithm. This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size. Signed-off-by: Chih-Chen Kao <[email protected]> * Support an input array in LDS that is 2 times the WG size. Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 clean up (#7) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * Feature/oro 0 clean up (#10) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * [ORO-0] SortKernel1. Less complex. (#8) SortKernel (occupancy: 8) - vgpr: 128 - lds: 6704 SortKernel1 (occupancy: 9) - vgpr: 106 - lds 7720 * [ORO-0] Kernel execution time check. * Fix the memory access pattern and change it to coalesced memory access. (#11) Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Single kernel sort for small keys. (#12) * Optimize the Count kernel for less LDS usage to achieve full occupancy (#13) * Optimize the Count kernel to let it use less LDS and could achieve full occupancy. Signed-off-by: Chih-Chen Kao <[email protected]> * Remove __threadfence_block() Removes the boundary check in the inner loop. The upper bound is set only once before going into the loop. Signed-off-by: Chih-Chen Kao <[email protected]> * Introduce DRIVER and RTC APIs * Disable enum-variant * Improve paths * Add fields * Update Vulkan test * Define CUDA in terms of DRIVER and RTC * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: takahiroharada <[email protected]> * Merging another merge (#18) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> * Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15) * Calculate the number of WGs based on LDS and max-thread-per-WGP. Signed-off-by: Chih-Chen Kao <[email protected]> * Add a workaround for CUDA. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Implement key-value pair sorting (#17) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <[email protected]> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <[email protected]> * Add basic structure for key-value pair sorting. Fix an error in single pass sort Signed-off-by: Chih-Chen Kao <[email protected]> * Add Value data in the test and sort it according to keys. Signed-off-by: Chih-Chen Kao <[email protected]> * Support Key only sorting. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Make single pass kernel non compile time switch. * Support both Key-Only & Key-Value pair sort kernels Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Test change. * [ORO-0] A bug. * [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible. Co-authored-by: Takahiro Harada <[email protected]> Co-authored-by: takahiroharada <[email protected]> * [ORO-0] Revert demo code. * Fix missing CUDA properties. (#26) * Update Orochi.cpp * [ORO-0] Clean up. * [ORO-0] OroUtils. (#27) * [ORO-0] OroUtils. * [ORO-0] Linux build fix. * [ORO-0] Forgot to add. * [ORO-0] Linux build fix. * [ORO-0] Clean up. Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> * Add kernel path and include dir to the functions. (#20) Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] BakeKernel. (#21) * [ORO-0] BakeKernel. * Update tools/genArgs.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/genArgs.py dead code removal * Update tools/stringify.py dead code removal * fix include Signed-off-by: Chih-Chen Kao <[email protected]> * fix script Signed-off-by: Chih-Chen Kao <[email protected]> * fix Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Fix Orochi CUDA API (#23) Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Linux build fix. (#22) * [ORO-0] Linux build fix. * Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Quick fix for old linux gcc which does not support std::exclusive_scan (#24) Quick fix for old linux gcc which does not support std::exclusive_scan Signed-off-by: Chih-Chen Kao <[email protected]> * Fix the kernel cache bug. (#25) Fix the kernel cache bug. The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Remove static variables. (#26) * [ORO-0] Remove static variables. * [ORO-0] Applied the suggestions. * [ORO-0] Linux regression fix. * Fix OrochiUtils::getFunctionFromString API (#27) Signed-off-by: Chih-Chen Kao <[email protected]> * Adding missing assert (#28) * Adding missing assert * Adding more asserts * Feature/oro 0 gpuopen merge (#31) * Fix oroGetDeviceProperties in cuda path. * Fix linux crash (#29) * [ORO-0] Added missing file. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Fix hipGetErrorString (#32) * [ORO-0] Fix hipGetErrorString It was incorrectly importing this API. Import the correct API in hipew. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33) * Add missing path on Apple config. (#34) * [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38) * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Add hiprtc.dll and comgr dll Co-authored-by: takahiroharada <[email protected]> * fix footnote markdown format (#39) * Fix orochi utils issue in unit tests Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: PixelClear <[email protected]> Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> Co-authored-by: Daniel Meister <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: PixelClear <[email protected]> * [ORO-0] bitcode/cubin linking APIs (#40) * [ORO-0] Link apis. * [ORO-0] Forgot to add. * [ORO-0] Linking test. * [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize * [ORO-0] Update link unit tests with comments * [ORO-0] Change test for CUBIN instead of PTX * [ORO-0] Fix loadfile to use binary mode, remove printf in kernel * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Load amdhip first, then hiprtc. * [ORO-0] Remove assert from hiprtc library checks * [ORO-0] Add gfx1030 bitcode for navi21 * [MNN-0] Fix premake and add more link testcases * [ORO-0] Update a link_null_name testcase * [ORO-0] Make unit tests more stable on CUDA * [ORO-0] Update bitcode for gfx1030 * [ORO-0] Add bitcodes for navi1,2, vega * [ORO-0] Add hiprtc.dll and comgr dll * [ORO-0] Add gfx906 bitcodes * [ORO-0] Support unit tests on both HIP and CUDA * [ORO-0] Update dlls and bitcodes * [ORO-0] Update bitcodes and generation script * [ORO-0] Minor fixes in bundled bitcode unit tests * [ORO-0] Fix typo in options * [ORO-0] Fix getCUBIN/PTX signatures * [ORO-0] Fix unit tests and generate fatbin for CUDA * [ORO-0] Regenerate fatbin and fix script * [ORO-0] Cleanup * [ORO-0] Update bundled bitcodes to only contain navi21 for now * [ORO-0] Updated bundled bitcode * [ORO-0] add ORO_LAUNCH_PARAMS_* * [ORO-0] Add unit test for orortcLinkAddFile * [ORO-0] Add unittest scripts for TC * [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA * [ORO-0] Add bitcode+bundled bitcode link test * [ORO-0] Cleanup * [ORO-0] Fix typo in script * [ORO-0] Update linux TC script Co-authored-by: takahiroharada <[email protected]> * [ORO-0] Get global memory size for CUDA (#44) * [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46) * [ORO-0] Update HIP dll's for bitcode linking support * [ORO-0] Add getLoweredName testcase * [ORO-0] Update unittest filter * [ORO-0] Update loweredName test * [ORO-0] Add missing test kernel * [ORO-0] Fix loweredName test * [ORO-0] Fix linux compilation * [ORO-0] Remove printf from test kernel (#37) * [ORO-0] Fix linux loading of libhiprtc.so (#49) * [ORO-0] Update test scripts (#50) * [ORO-0] Update scripts for linux (#51) * [ORO-0] Add new scripts (#52) * [ORO-0] Add new scripts * [ORO-0] Add execute permissions to scripts * Fix Unit Test: getErrorString (#54) Signed-off-by: Chih-Chen Kao <[email protected]> Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Support hiprtc0504 (#55) * [ORO-0] Update hiprtc and orortc error codes (#57) * [ORO-0] Update test scripts to delete cache before running (#58) * [ORO-0] Update hiprtc dlls * [ORO-0] Support gfx1100,gfx1102 for radix sort kernel precompilation * Fix apt python installation (#63) Update checkout version Signed-off-by: Chih-Chen Kao <[email protected]> Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] OrochiUtils update. (#61) * [ORO-0] Add WMMA test (#62) * [ORO-0] Add WMMA test * [ORO-0] Add a comment for WMMA * [ORO-0] Cleanup * [ORO-0] Add a couple more comments * [ORO-0] Remove hip_runtime include * [ORO-0] Cleanup * [ORO-0] Fix comment * [ORO-0] Add Copyright notice * [ORO-0] Load binary from the directory where DLL is. * [ORO-0] Fix for linux. --------- Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Takahiro Harada <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> Co-authored-by: Daniel Meister <[email protected]> Co-authored-by: PixelClear <[email protected]>
* Add gitignore to the repository Signed-off-by: Chih-Chen Kao <[email protected]> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 radix sort (#19) * [ORO-0] Working 8 bit radix sort. * [ORO-0] Some optimization. * Create LICENSE * Update README.md (#15) * Feature/oro 0 raw get set (#19) * [ORO-0] Rename setter and getter. * [ORO-0] Fix when there is a dll but no device. * [ORO-0] Deletion function. * [ORO-0] Multi processor count. * [ORO-0] Extended the sort to more than 8 bits. Implemented tests. * [ORO-0] Moved temp buffer allocation out from the sort(). * [ORO-0] README. References. * [ORO-0] Debug flag. * Refactor the code to add the basic constructs to support selecting different scan algorithms. Add different implementation of the scan algorithm: CPU, single WG and all WG . Signed-off-by: Chih-Chen Kao <[email protected]> * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * Optimization: Implement the single-pass kernel for GPU parallel scan. Fix a GPU memory bug. Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 kernel cache (#4) * [ORO-0] Cache kernel. * [ORO-0] Support newer HIP builds on windows (#22) * [ORO-0] Unit test. (#23) * Fix LDS scan bug. The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block). Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap). Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize the LDS scan algorithm. (#6) * Optimize the LDS scan algorithm. This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size. Signed-off-by: Chih-Chen Kao <[email protected]> * Support an input array in LDS that is 2 times the WG size. Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 clean up (#7) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * Feature/oro 0 clean up (#10) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * [ORO-0] SortKernel1. Less complex. (#8) SortKernel (occupancy: 8) - vgpr: 128 - lds: 6704 SortKernel1 (occupancy: 9) - vgpr: 106 - lds 7720 * [ORO-0] Kernel execution time check. * Fix the memory access pattern and change it to coalesced memory access. (#11) Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Single kernel sort for small keys. (#12) * Optimize the Count kernel for less LDS usage to achieve full occupancy (#13) * Optimize the Count kernel to let it use less LDS and could achieve full occupancy. Signed-off-by: Chih-Chen Kao <[email protected]> * Remove __threadfence_block() Removes the boundary check in the inner loop. The upper bound is set only once before going into the loop. Signed-off-by: Chih-Chen Kao <[email protected]> * Introduce DRIVER and RTC APIs * Disable enum-variant * Improve paths * Add fields * Update Vulkan test * Define CUDA in terms of DRIVER and RTC * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: takahiroharada <[email protected]> * Merging another merge (#18) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> * Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15) * Calculate the number of WGs based on LDS and max-thread-per-WGP. Signed-off-by: Chih-Chen Kao <[email protected]> * Add a workaround for CUDA. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Implement key-value pair sorting (#17) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <[email protected]> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <[email protected]> * Add basic structure for key-value pair sorting. Fix an error in single pass sort Signed-off-by: Chih-Chen Kao <[email protected]> * Add Value data in the test and sort it according to keys. Signed-off-by: Chih-Chen Kao <[email protected]> * Support Key only sorting. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Make single pass kernel non compile time switch. * Support both Key-Only & Key-Value pair sort kernels Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Test change. * [ORO-0] A bug. * [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible. Co-authored-by: Takahiro Harada <[email protected]> Co-authored-by: takahiroharada <[email protected]> * [ORO-0] Revert demo code. * Fix missing CUDA properties. (#26) * Update Orochi.cpp * [ORO-0] Clean up. * [ORO-0] OroUtils. (#27) * [ORO-0] OroUtils. * [ORO-0] Linux build fix. * [ORO-0] Forgot to add. * [ORO-0] Linux build fix. * [ORO-0] Clean up. Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> * Add kernel path and include dir to the functions. (#20) Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] BakeKernel. (#21) * [ORO-0] BakeKernel. * Update tools/genArgs.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/genArgs.py dead code removal * Update tools/stringify.py dead code removal * fix include Signed-off-by: Chih-Chen Kao <[email protected]> * fix script Signed-off-by: Chih-Chen Kao <[email protected]> * fix Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Fix Orochi CUDA API (#23) Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Linux build fix. (#22) * [ORO-0] Linux build fix. * Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Quick fix for old linux gcc which does not support std::exclusive_scan (#24) Quick fix for old linux gcc which does not support std::exclusive_scan Signed-off-by: Chih-Chen Kao <[email protected]> * Fix the kernel cache bug. (#25) Fix the kernel cache bug. The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Remove static variables. (#26) * [ORO-0] Remove static variables. * [ORO-0] Applied the suggestions. * [ORO-0] Linux regression fix. * Fix OrochiUtils::getFunctionFromString API (#27) Signed-off-by: Chih-Chen Kao <[email protected]> * Adding missing assert (#28) * Adding missing assert * Adding more asserts * Feature/oro 0 gpuopen merge (#31) * Fix oroGetDeviceProperties in cuda path. * Fix linux crash (#29) * [ORO-0] Added missing file. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Fix hipGetErrorString (#32) * [ORO-0] Fix hipGetErrorString It was incorrectly importing this API. Import the correct API in hipew. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33) * Add missing path on Apple config. (#34) * [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38) * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Add hiprtc.dll and comgr dll Co-authored-by: takahiroharada <[email protected]> * fix footnote markdown format (#39) * Fix orochi utils issue in unit tests Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: PixelClear <[email protected]> * remove space after -I (#33) * Feature/oro 0 gpuopen merge 2 (#32) * Fix oroGetDeviceProperties in cuda path. * Fix linux crash (#29) * [ORO-0] Added missing file. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Fix hipGetErrorString (#32) * [ORO-0] Fix hipGetErrorString It was incorrectly importing this API. Import the correct API in hipew. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33) * Add missing path on Apple config. (#34) * [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38) * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Add hiprtc.dll and comgr dll Co-authored-by: takahiroharada <[email protected]> * fix footnote markdown format (#39) * Feature/oro 0 amdadvtech merge (#43) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <[email protected]> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 radix sort (#19) * [ORO-0] Working 8 bit radix sort. * [ORO-0] Some optimization. * Create LICENSE * Update README.md (#15) * Feature/oro 0 raw get set (#19) * [ORO-0] Rename setter and getter. * [ORO-0] Fix when there is a dll but no device. * [ORO-0] Deletion function. * [ORO-0] Multi processor count. * [ORO-0] Extended the sort to more than 8 bits. Implemented tests. * [ORO-0] Moved temp buffer allocation out from the sort(). * [ORO-0] README. References. * [ORO-0] Debug flag. * Refactor the code to add the basic constructs to support selecting different scan algorithms. Add different implementation of the scan algorithm: CPU, single WG and all WG . Signed-off-by: Chih-Chen Kao <[email protected]> * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * Optimization: Implement the single-pass kernel for GPU parallel scan. Fix a GPU memory bug. Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 kernel cache (#4) * [ORO-0] Cache kernel. * [ORO-0] Support newer HIP builds on windows (#22) * [ORO-0] Unit test. (#23) * Fix LDS scan bug. The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block). Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap). Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize the LDS scan algorithm. (#6) * Optimize the LDS scan algorithm. This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size. Signed-off-by: Chih-Chen Kao <[email protected]> * Support an input array in LDS that is 2 times the WG size. Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 clean up (#7) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * Feature/oro 0 clean up (#10) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * [ORO-0] SortKernel1. Less complex. (#8) SortKernel (occupancy: 8) - vgpr: 128 - lds: 6704 SortKernel1 (occupancy: 9) - vgpr: 106 - lds 7720 * [ORO-0] Kernel execution time check. * Fix the memory access pattern and change it to coalesced memory access. (#11) Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Single kernel sort for small keys. (#12) * Optimize the Count kernel for less LDS usage to achieve full occupancy (#13) * Optimize the Count kernel to let it use less LDS and could achieve full occupancy. Signed-off-by: Chih-Chen Kao <[email protected]> * Remove __threadfence_block() Removes the boundary check in the inner loop. The upper bound is set only once before going into the loop. Signed-off-by: Chih-Chen Kao <[email protected]> * Introduce DRIVER and RTC APIs * Disable enum-variant * Improve paths * Add fields * Update Vulkan test * Define CUDA in terms of DRIVER and RTC * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: takahiroharada <[email protected]> * Merging another merge (#18) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> * Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15) * Calculate the number of WGs based on LDS and max-thread-per-WGP. Signed-off-by: Chih-Chen Kao <[email protected]> * Add a workaround for CUDA. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Implement key-value pair sorting (#17) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <[email protected]> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <[email protected]> * Add basic structure for key-value pair sorting. Fix an error in single pass sort Signed-off-by: Chih-Chen Kao <[email protected]> * Add Value data in the test and sort it according to keys. Signed-off-by: Chih-Chen Kao <[email protected]> * Support Key only sorting. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Make single pass kernel non compile time switch. * Support both Key-Only & Key-Value pair sort kernels Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Test change. * [ORO-0] A bug. * [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible. Co-authored-by: Takahiro Harada <[email protected]> Co-authored-by: takahiroharada <[email protected]> * [ORO-0] Revert demo code. * Fix missing CUDA properties. (#26) * Update Orochi.cpp * [ORO-0] Clean up. * [ORO-0] OroUtils. (#27) * [ORO-0] OroUtils. * [ORO-0] Linux build fix. * [ORO-0] Forgot to add. * [ORO-0] Linux build fix. * [ORO-0] Clean up. Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> * Add kernel path and include dir to the functions. (#20) Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] BakeKernel. (#21) * [ORO-0] BakeKernel. * Update tools/genArgs.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/genArgs.py dead code removal * Update tools/stringify.py dead code removal * fix include Signed-off-by: Chih-Chen Kao <[email protected]> * fix script Signed-off-by: Chih-Chen Kao <[email protected]> * fix Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Fix Orochi CUDA API (#23) Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Linux build fix. (#22) * [ORO-0] Linux build fix. * Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Quick fix for old linux gcc which does not support std::exclusive_scan (#24) Quick fix for old linux gcc which does not support std::exclusive_scan Signed-off-by: Chih-Chen Kao <[email protected]> * Fix the kernel cache bug. (#25) Fix the kernel cache bug. The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Remove static variables. (#26) * [ORO-0] Remove static variables. * [ORO-0] Applied the suggestions. * [ORO-0] Linux regression fix. * Fix OrochiUtils::getFunctionFromString API (#27) Signed-off-by: Chih-Chen Kao <[email protected]> * Adding missing assert (#28) * Adding missing assert * Adding more asserts * Feature/oro 0 gpuopen merge (#31) * Fix oroGetDeviceProperties in cuda path. * Fix linux crash (#29) * [ORO-0] Added missing file. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Fix hipGetErrorString (#32) * [ORO-0] Fix hipGetErrorString It was incorrectly importing this API. Import the correct API in hipew. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33) * Add missing path on Apple config. (#34) * [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38) * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Add hiprtc.dll and comgr dll Co-authored-by: takahiroharada <[email protected]> * fix footnote markdown format (#39) * Fix orochi utils issue in unit tests Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: PixelClear <[email protected]> Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> Co-authored-by: Daniel Meister <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: PixelClear <[email protected]> * [ORO-0] bitcode/cubin linking APIs (#40) * [ORO-0] Link apis. * [ORO-0] Forgot to add. * [ORO-0] Linking test. * [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize * [ORO-0] Update link unit tests with comments * [ORO-0] Change test for CUBIN instead of PTX * [ORO-0] Fix loadfile to use binary mode, remove printf in kernel * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Load amdhip first, then hiprtc. * [ORO-0] Remove assert from hiprtc library checks * [ORO-0] Add gfx1030 bitcode for navi21 * [MNN-0] Fix premake and add more link testcases * [ORO-0] Update a link_null_name testcase * [ORO-0] Make unit tests more stable on CUDA * [ORO-0] Update bitcode for gfx1030 * [ORO-0] Add bitcodes for navi1,2, vega * [ORO-0] Add hiprtc.dll and comgr dll * [ORO-0] Add gfx906 bitcodes * [ORO-0] Support unit tests on both HIP and CUDA * [ORO-0] Update dlls and bitcodes * [ORO-0] Update bitcodes and generation script * [ORO-0] Minor fixes in bundled bitcode unit tests * [ORO-0] Fix typo in options * [ORO-0] Fix getCUBIN/PTX signatures * [ORO-0] Fix unit tests and generate fatbin for CUDA * [ORO-0] Regenerate fatbin and fix script * [ORO-0] Cleanup * [ORO-0] Update bundled bitcodes to only contain navi21 for now * [ORO-0] Updated bundled bitcode * [ORO-0] add ORO_LAUNCH_PARAMS_* * [ORO-0] Add unit test for orortcLinkAddFile * [ORO-0] Add unittest scripts for TC * [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA * [ORO-0] Add bitcode+bundled bitcode link test * [ORO-0] Cleanup * [ORO-0] Fix typo in script * [ORO-0] Update linux TC script Co-authored-by: takahiroharada <[email protected]> * [ORO-0] Get global memory size for CUDA (#44) * [ORO-0] Update HIP dll's for bitcode linking support * [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46) * [ORO-0] Update HIP dll's for bitcode linking support * [ORO-0] Add getLoweredName testcase * [ORO-0] Update unittest filter * [ORO-0] Update loweredName test * [ORO-0] Add missing test kernel * [ORO-0] Fix loweredName test * [ORO-0] Fix linux compilation * [ORO-0] Remove printf from test kernel (#37) * [ORO-0] Allow usage of libhiprtc64.so if exists * [ORO-0] Fix linux loading of libhiprtc.so Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Takahiro Harada <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> Co-authored-by: Daniel Meister <[email protected]> Co-authored-by: PixelClear <[email protected]> * Feature/oro 0 radix sort stream (#34) * Initial commit * Streams to the configuration * Mutex in OrochiUtils * Feature/oro 0 radix sort mutex baking (#36) * Locking other methods in OrochiUtils * Removing mutex from static methods * Making mutex and map static * Removing static from OrochiUtils * Removing static from OrochiUtils * Support Precompiled Kernels in Orochi (#37) * Add bitcode support: getFunctionFromPrecompiledBinary Signed-off-by: Chih-Chen Kao <[email protected]> * Add bitcode and the script to generate it. Signed-off-by: Chih-Chen Kao <[email protected]> * rewrite OROASSERT. Fix include file order. Signed-off-by: Chih-Chen Kao <[email protected]> * Use string instead of const char* Signed-off-by: Chih-Chen Kao <[email protected]> * Rename the option from bitcode to precompiled Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Add bitcode script for nvidia fatbin * [ORO-0] CUDA - hipfb->fatbin rename Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Aaryaman Vasishta <[email protected]> * Feature/oro 0 resource limits (#38) * Adding limit functions * Removing enum * Removing enum * Limit enum * char string Windows API (#39) * [ORO-0] Update precompiled radix sort kernels to use -ffast-math (#42) * [ORO-0] Update precompiled radix sort kernels to use -ffast-math * [ORO-0] Update RadixSort fatbin for NVIDIA and use fast math * [ORO-0] Function pointer test. (#40) * [ORO-0] Function pointer test. * [ORO-0] launch2d. * [ORO-0] Event, OroStopwatch. * Implement GpuMemory to handle device memory operations. Signed-off-by: Chih-Chen Kao <[email protected]> * Sync with GPUOpen/LibrariesAndSDKs/Orochi (#44) * Fix oroGetDeviceProperties in cuda path. * Fix linux crash (#29) * [ORO-0] Added missing file. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Fix hipGetErrorString (#32) * [ORO-0] Fix hipGetErrorString It was incorrectly importing this API. Import the correct API in hipew. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33) * Add missing path on Apple config. (#34) * [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38) * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Add hiprtc.dll and comgr dll Co-authored-by: takahiroharada <[email protected]> * fix footnote markdown format (#39) * Feature/oro 0 amdadvtech merge (#43) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <[email protected]> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 radix sort (#19) * [ORO-0] Working 8 bit radix sort. * [ORO-0] Some optimization. * Create LICENSE * Update README.md (#15) * Feature/oro 0 raw get set (#19) * [ORO-0] Rename setter and getter. * [ORO-0] Fix when there is a dll but no device. * [ORO-0] Deletion function. * [ORO-0] Multi processor count. * [ORO-0] Extended the sort to more than 8 bits. Implemented tests. * [ORO-0] Moved temp buffer allocation out from the sort(). * [ORO-0] README. References. * [ORO-0] Debug flag. * Refactor the code to add the basic constructs to support selecting different scan algorithms. Add different implementation of the scan algorithm: CPU, single WG and all WG . Signed-off-by: Chih-Chen Kao <[email protected]> * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * Optimization: Implement the single-pass kernel for GPU parallel scan. Fix a GPU memory bug. Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 kernel cache (#4) * [ORO-0] Cache kernel. * [ORO-0] Support newer HIP builds on windows (#22) * [ORO-0] Unit test. (#23) * Fix LDS scan bug. The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block). Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap). Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize the LDS scan algorithm. (#6) * Optimize the LDS scan algorithm. This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size. Signed-off-by: Chih-Chen Kao <[email protected]> * Support an input array in LDS that is 2 times the WG size. Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 clean up (#7) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * Feature/oro 0 clean up (#10) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * [ORO-0] SortKernel1. Less complex. (#8) SortKernel (occupancy: 8) - vgpr: 128 - lds: 6704 SortKernel1 (occupancy: 9) - vgpr: 106 - lds 7720 * [ORO-0] Kernel execution time check. * Fix the memory access pattern and change it to coalesced memory access. (#11) Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Single kernel sort for small keys. (#12) * Optimize the Count kernel for less LDS usage to achieve full occupancy (#13) * Optimize the Count kernel to let it use less LDS and could achieve full occupancy. Signed-off-by: Chih-Chen Kao <[email protected]> * Remove __threadfence_block() Removes the boundary check in the inner loop. The upper bound is set only once before going into the loop. Signed-off-by: Chih-Chen Kao <[email protected]> * Introduce DRIVER and RTC APIs * Disable enum-variant * Improve paths * Add fields * Update Vulkan test * Define CUDA in terms of DRIVER and RTC * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: takahiroharada <[email protected]> * Merging another merge (#18) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> * Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15) * Calculate the number of WGs based on LDS and max-thread-per-WGP. Signed-off-by: Chih-Chen Kao <[email protected]> * Add a workaround for CUDA. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Implement key-value pair sorting (#17) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <[email protected]> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <[email protected]> * Add basic structure for key-value pair sorting. Fix an error in single pass sort Signed-off-by: Chih-Chen Kao <[email protected]> * Add Value data in the test and sort it according to keys. Signed-off-by: Chih-Chen Kao <[email protected]> * Support Key only sorting. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Make single pass kernel non compile time switch. * Support both Key-Only & Key-Value pair sort kernels Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Test change. * [ORO-0] A bug. * [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible. Co-authored-by: Takahiro Harada <[email protected]> Co-authored-by: takahiroharada <[email protected]> * [ORO-0] Revert demo code. * Fix missing CUDA properties. (#26) * Update Orochi.cpp * [ORO-0] Clean up. * [ORO-0] OroUtils. (#27) * [ORO-0] OroUtils. * [ORO-0] Linux build fix. * [ORO-0] Forgot to add. * [ORO-0] Linux build fix. * [ORO-0] Clean up. Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> * Add kernel path and include dir to the functions. (#20) Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] BakeKernel. (#21) * [ORO-0] BakeKernel. * Update tools/genArgs.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/genArgs.py dead code removal * Update tools/stringify.py dead code removal * fix include Signed-off-by: Chih-Chen Kao <[email protected]> * fix script Signed-off-by: Chih-Chen Kao <[email protected]> * fix Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Fix Orochi CUDA API (#23) Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Linux build fix. (#22) * [ORO-0] Linux build fix. * Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Quick fix for old linux gcc which does not support std::exclusive_scan (#24) Quick fix for old linux gcc which does not support std::exclusive_scan Signed-off-by: Chih-Chen Kao <[email protected]> * Fix the kernel cache bug. (#25) Fix the kernel cache bug. The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Remove static variables. (#26) * [ORO-0] Remove static variables. * [ORO-0] Applied the suggestions. * [ORO-0] Linux regression fix. * Fix OrochiUtils::getFunctionFromString API (#27) Signed-off-by: Chih-Chen Kao <[email protected]> * Adding missing assert (#28) * Adding missing assert * Adding more asserts * Feature/oro 0 gpuopen merge (#31) * Fix oroGetDeviceProperties in cuda path. * Fix linux crash (#29) * [ORO-0] Added missing file. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Fix hipGetErrorString (#32) * [ORO-0] Fix hipGetErrorString It was incorrectly importing this API. Import the correct API in hipew. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33) * Add missing path on Apple config. (#34) * [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38) * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Add hiprtc.dll and comgr dll Co-authored-by: takahiroharada <[email protected]> * fix footnote markdown format (#39) * Fix orochi utils issue in unit tests Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: PixelClear <[email protected]> Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> Co-authored-by: Daniel Meister <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: PixelClear <[email protected]> * [ORO-0] bitcode/cubin linking APIs (#40) * [ORO-0] Link apis. * [ORO-0] Forgot to add. * [ORO-0] Linking test. * [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize * [ORO-0] Update link unit tests with comments * [ORO-0] Change test for CUBIN instead of PTX * [ORO-0] Fix loadfile to use binary mode, remove printf in kernel * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Load amdhip first, then hiprtc. * [ORO-0] Remove assert from hiprtc library checks * [ORO-0] Add gfx1030 bitcode for navi21 * [MNN-0] Fix premake and add more link testcases * [ORO-0] Update a link_null_name testcase * [ORO-0] Make unit tests more stable on CUDA * [ORO-0] Update bitcode for gfx1030 * [ORO-0] Add bitcodes for navi1,2, vega * [ORO-0] Add hiprtc.dll and comgr dll * [ORO-0] Add gfx906 bitcodes * [ORO-0] Support unit tests on both HIP and CUDA * [ORO-0] Update dlls and bitcodes * [ORO-0] Update bitcodes and generation script * [ORO-0] Minor fixes in bundled bitcode unit tests * [ORO-0] Fix typo in options * [ORO-0] Fix getCUBIN/PTX signatures * [ORO-0] Fix unit tests and generate fatbin for CUDA * [ORO-0] Regenerate fatbin and fix script * [ORO-0] Cleanup * [ORO-0] Update bundled bitcodes to only contain navi21 for now * [ORO-0] Updated bundled bitcode * [ORO-0] add ORO_LAUNCH_PARAMS_* * [ORO-0] Add unit test for orortcLinkAddFile * [ORO-0] Add unittest scripts for TC * [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA * [ORO-0] Add bitcode+bundled bitcode link test * [ORO-0] Cleanup * [ORO-0] Fix typo in script * [ORO-0] Update linux TC script Co-authored-by: takahiroharada <[email protected]> * [ORO-0] Get global memory size for CUDA (#44) * [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46) * [ORO-0] Update HIP dll's for bitcode linking support * [ORO-0] Add getLoweredName testcase * [ORO-0] Update unittest filter * [ORO-0] Update loweredName test * [ORO-0] Add missing test kernel * [ORO-0] Fix loweredName test * [ORO-0] Fix linux compilation * [ORO-0] Remove printf from test kernel (#37) * [ORO-0] Fix linux loading of libhiprtc.so (#49) * [ORO-0] Update test scripts (#50) * [ORO-0] Update scripts for linux (#51) * [ORO-0] Add new scripts (#52) * [ORO-0] Add new scripts * [ORO-0] Add execute permissions to scripts * Fix Unit Test: getErrorString (#54) Signed-off-by: Chih-Chen Kao <[email protected]> Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Support hiprtc0504 (#55) * [ORO-0] Update hiprtc and orortc error codes (#57) * [ORO-0] Update test scripts to delete cache before running (#58) * [ORO-0] Update hiprtc dlls * [ORO-0] Support gfx1100,gfx1102 for radix sort kernel precompilation * Fix apt python installation (#63) Update checkout version Signed-off-by: Chih-Chen Kao <[email protected]> Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] OrochiUtils update. (#61) * [ORO-0] Add WMMA test (#62) * [ORO-0] Add WMMA test * [ORO-0] Add a comment for WMMA * [ORO-0] Cleanup * [ORO-0] Add a couple more comments * [ORO-0] Remove hip_runtime include * [ORO-0] Cleanup * [ORO-0] Fix comment * [ORO-0] Add Copyright notice * [ORO-0] Load binary from the directory where DLL is. * [ORO-0] Fix for linux. --------- Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Takahiro Harada <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> Co-authored-by: Daniel Meister <[email protected]> Co-authored-by: PixelClear <[email protected]> * [ORO-0] Remove unnecessary template. * [ORO-0] Clean up. Added python script kernelCompile.py for compilation. (#46) * [ORO-0] Clean up. Added python script kernelCompile.py for compilation. * [ORO-0] hipsdk should be next to orochi dir. * Update ParallelPrimitives/RadixSortKernels.h Remove commented line --------- Co-authored-by: Chih-Chen Kao <[email protected]> * [ORO-0] add automatic arch selection (#47) * [ORO-0] add automatic arch selection * [ORO-0] Refactor and error output when it cannot find llc. --------- Co-authored-by: takahiroharada <[email protected]> * Feature/oro 0 flexible rtc error handling cherrypick (#48) * add a handler for RTC load failure case on cuda. * [ORO-0] add a handler for RTC load failure case on hip. * [ORO-0] add cuda 12.0 sdk in nvrtc path * [ORO-0] Remove non bundled bitcode tests. Clean up. * [ORO-0] Clean up. * [ORO-0] Add hiprtcGetBitcodeSize back. * Update Orochi.cpp * Update Orochi.cpp * [ORO-0] Fix for multi-GPU/iGPU * [HIPSDK-0] compute-22.40-osdb/36/ * [ORO-0] compute-23.10-osdb/9/ * [ORO-0] Update dll names * [ORO-0] implement new test for managed memory, enable managed memory api, fix all warnings and cleanup * [ORO-0] fix compile issues * [ORO-0] fix declaration of oroManagedMalloc * [ORO-0] change streaming kernel * [ORO-0] enable it on windows too * [ORO-0] add more asserts * [ORO-0] update kernel * [ORO-0] add host copy times * [ORO-0] add malloc times * Refactor Count Signed-off-by: Chih-Chen Kao <[email protected]> * Refactor Radix Sort class: - Now the tmp buffer is allocated internally. - All GPU memory buffers are changed to the GpuMemory class - `configure` will now calculate the total number of GPU blocks for the count and the scan kernel - The client does not need to call configure explicitly - Refactor function parameters - Remove count reference kernel Signed-off-by: Chih-Chen Kao <[email protected]> * Add `const` Signed-off-by: Chih-Chen Kao <[email protected]> * Thid commit does the followings: - Support setting the the number of thread per block (a.k.a block size) dynamically - Refactor `exclusiveScanCpu` - Extend `printKernelInfo`. Signed-off-by: Chih-Chen Kao <[email protected]> * The 1st working example for the radix sort optimization Signed-off-by: Chih-Chen Kao <[email protected]> * Support configuring dynamic "NUM_WARPS_PER_BLOCK" in the sort kernel Compute the optimal number of inputs for each block to handle. Refactor the usage of stopwatch Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] add hiprtc future dll names in hiprtc path * Add linux paths and dll names (#66) * [ORO-0] Change path and rtc dll names * [ORO-0] Make scripts executable * [ORO-0] Add hiprtc path * [ORO-0] Remove ParallelPrimitives, test/radix sort * [ORO-0] Edit premake --------- Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: Takahiro Harada <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: Daniel Meister <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: PixelClear <[email protected]> Co-authored-by: Richard Geslot <[email protected]> Co-authored-by: Atsushi Yoshimura <[email protected]> Co-authored-by: Atsushi.Yoshimura <[email protected]>
Stuff like this really exemplifies the issue I have with AMD's attitude with GPUOpen, it feels much more about being opensource as a bullet point than making the tools developers would want to be API agnostic and support all major GPU vendors. AMD wants to compete with Nvidia and unlock some of the marketshare, but developers if they care, just want to not be locked in and have code that works everywhere. That this project doesn't see a reason to make OneAPI (or CPU) backends work as a priority (or bring them up to the 2.0 code) dooms this to being for those devs that care to support AMD specifically. It seems just as self serving as Nvidia's vendor lock in, except that they have the dominant position and AMD is playing catchup. |
Hi,
if we could get an Orochi OneAPI backend, then GPU desktop support, should be complete, from a vendor viewpoint..
hope it gets added eventually.. and I get notified when it's done by having opened the issue :-)
EDIT: don't know if OneAPI is currently as complete as HIP and CUDA like the rtc component for example so adding support for it, is possible or easy..
The text was updated successfully, but these errors were encountered: