Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Sync with GPUOpen/LibrariesAndSDKs/Orochi (#44)
* Fix oroGetDeviceProperties in cuda path. * Fix linux crash (#29) * [ORO-0] Added missing file. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Fix hipGetErrorString (#32) * [ORO-0] Fix hipGetErrorString It was incorrectly importing this API. Import the correct API in hipew. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33) * Add missing path on Apple config. (#34) * [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38) * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Add hiprtc.dll and comgr dll Co-authored-by: takahiroharada <[email protected]> * fix footnote markdown format (#39) * Feature/oro 0 amdadvtech merge (#43) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <[email protected]> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 radix sort (#19) * [ORO-0] Working 8 bit radix sort. * [ORO-0] Some optimization. * Create LICENSE * Update README.md (#15) * Feature/oro 0 raw get set (#19) * [ORO-0] Rename setter and getter. * [ORO-0] Fix when there is a dll but no device. * [ORO-0] Deletion function. * [ORO-0] Multi processor count. * [ORO-0] Extended the sort to more than 8 bits. Implemented tests. * [ORO-0] Moved temp buffer allocation out from the sort(). * [ORO-0] README. References. * [ORO-0] Debug flag. * Refactor the code to add the basic constructs to support selecting different scan algorithms. Add different implementation of the scan algorithm: CPU, single WG and all WG . Signed-off-by: Chih-Chen Kao <[email protected]> * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * Optimization: Implement the single-pass kernel for GPU parallel scan. Fix a GPU memory bug. Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 kernel cache (#4) * [ORO-0] Cache kernel. * [ORO-0] Support newer HIP builds on windows (#22) * [ORO-0] Unit test. (#23) * Fix LDS scan bug. The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block). Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap). Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize the LDS scan algorithm. (#6) * Optimize the LDS scan algorithm. This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size. Signed-off-by: Chih-Chen Kao <[email protected]> * Support an input array in LDS that is 2 times the WG size. Signed-off-by: Chih-Chen Kao <[email protected]> * Feature/oro 0 clean up (#7) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * Feature/oro 0 clean up (#10) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <[email protected]> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * [ORO-0] SortKernel1. Less complex. (#8) SortKernel (occupancy: 8) - vgpr: 128 - lds: 6704 SortKernel1 (occupancy: 9) - vgpr: 106 - lds 7720 * [ORO-0] Kernel execution time check. * Fix the memory access pattern and change it to coalesced memory access. (#11) Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Single kernel sort for small keys. (#12) * Optimize the Count kernel for less LDS usage to achieve full occupancy (#13) * Optimize the Count kernel to let it use less LDS and could achieve full occupancy. Signed-off-by: Chih-Chen Kao <[email protected]> * Remove __threadfence_block() Removes the boundary check in the inner loop. The upper bound is set only once before going into the loop. Signed-off-by: Chih-Chen Kao <[email protected]> * Introduce DRIVER and RTC APIs * Disable enum-variant * Improve paths * Add fields * Update Vulkan test * Define CUDA in terms of DRIVER and RTC * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: takahiroharada <[email protected]> * Merging another merge (#18) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> * Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15) * Calculate the number of WGs based on LDS and max-thread-per-WGP. Signed-off-by: Chih-Chen Kao <[email protected]> * Add a workaround for CUDA. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <[email protected]> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <[email protected]> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Implement key-value pair sorting (#17) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <[email protected]> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <[email protected]> * Add basic structure for key-value pair sorting. Fix an error in single pass sort Signed-off-by: Chih-Chen Kao <[email protected]> * Add Value data in the test and sort it according to keys. Signed-off-by: Chih-Chen Kao <[email protected]> * Support Key only sorting. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Make single pass kernel non compile time switch. * Support both Key-Only & Key-Value pair sort kernels Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Test change. * [ORO-0] A bug. * [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible. Co-authored-by: Takahiro Harada <[email protected]> Co-authored-by: takahiroharada <[email protected]> * [ORO-0] Revert demo code. * Fix missing CUDA properties. (#26) * Update Orochi.cpp * [ORO-0] Clean up. * [ORO-0] OroUtils. (#27) * [ORO-0] OroUtils. * [ORO-0] Linux build fix. * [ORO-0] Forgot to add. * [ORO-0] Linux build fix. * [ORO-0] Clean up. Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> * Add kernel path and include dir to the functions. (#20) Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] BakeKernel. (#21) * [ORO-0] BakeKernel. * Update tools/genArgs.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/genArgs.py dead code removal * Update tools/stringify.py dead code removal * fix include Signed-off-by: Chih-Chen Kao <[email protected]> * fix script Signed-off-by: Chih-Chen Kao <[email protected]> * fix Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Fix Orochi CUDA API (#23) Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Linux build fix. (#22) * [ORO-0] Linux build fix. * Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> * Quick fix for old linux gcc which does not support std::exclusive_scan (#24) Quick fix for old linux gcc which does not support std::exclusive_scan Signed-off-by: Chih-Chen Kao <[email protected]> * Fix the kernel cache bug. (#25) Fix the kernel cache bug. The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid. Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Remove static variables. (#26) * [ORO-0] Remove static variables. * [ORO-0] Applied the suggestions. * [ORO-0] Linux regression fix. * Fix OrochiUtils::getFunctionFromString API (#27) Signed-off-by: Chih-Chen Kao <[email protected]> * Adding missing assert (#28) * Adding missing assert * Adding more asserts * Feature/oro 0 gpuopen merge (#31) * Fix oroGetDeviceProperties in cuda path. * Fix linux crash (#29) * [ORO-0] Added missing file. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Fix hipGetErrorString (#32) * [ORO-0] Fix hipGetErrorString It was incorrectly importing this API. Import the correct API in hipew. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33) * Add missing path on Apple config. (#34) * [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38) * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Add hiprtc.dll and comgr dll Co-authored-by: takahiroharada <[email protected]> * fix footnote markdown format (#39) * Fix orochi utils issue in unit tests Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: PixelClear <[email protected]> Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: Aaryaman Vasishta <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> Co-authored-by: Daniel Meister <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: PixelClear <[email protected]> * [ORO-0] bitcode/cubin linking APIs (#40) * [ORO-0] Link apis. * [ORO-0] Forgot to add. * [ORO-0] Linking test. * [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize * [ORO-0] Update link unit tests with comments * [ORO-0] Change test for CUBIN instead of PTX * [ORO-0] Fix loadfile to use binary mode, remove printf in kernel * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Load amdhip first, then hiprtc. * [ORO-0] Remove assert from hiprtc library checks * [ORO-0] Add gfx1030 bitcode for navi21 * [MNN-0] Fix premake and add more link testcases * [ORO-0] Update a link_null_name testcase * [ORO-0] Make unit tests more stable on CUDA * [ORO-0] Update bitcode for gfx1030 * [ORO-0] Add bitcodes for navi1,2, vega * [ORO-0] Add hiprtc.dll and comgr dll * [ORO-0] Add gfx906 bitcodes * [ORO-0] Support unit tests on both HIP and CUDA * [ORO-0] Update dlls and bitcodes * [ORO-0] Update bitcodes and generation script * [ORO-0] Minor fixes in bundled bitcode unit tests * [ORO-0] Fix typo in options * [ORO-0] Fix getCUBIN/PTX signatures * [ORO-0] Fix unit tests and generate fatbin for CUDA * [ORO-0] Regenerate fatbin and fix script * [ORO-0] Cleanup * [ORO-0] Update bundled bitcodes to only contain navi21 for now * [ORO-0] Updated bundled bitcode * [ORO-0] add ORO_LAUNCH_PARAMS_* * [ORO-0] Add unit test for orortcLinkAddFile * [ORO-0] Add unittest scripts for TC * [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA * [ORO-0] Add bitcode+bundled bitcode link test * [ORO-0] Cleanup * [ORO-0] Fix typo in script * [ORO-0] Update linux TC script Co-authored-by: takahiroharada <[email protected]> * [ORO-0] Get global memory size for CUDA (#44) * [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46) * [ORO-0] Update HIP dll's for bitcode linking support * [ORO-0] Add getLoweredName testcase * [ORO-0] Update unittest filter * [ORO-0] Update loweredName test * [ORO-0] Add missing test kernel * [ORO-0] Fix loweredName test * [ORO-0] Fix linux compilation * [ORO-0] Remove printf from test kernel (#37) * [ORO-0] Fix linux loading of libhiprtc.so (#49) * [ORO-0] Update test scripts (#50) * [ORO-0] Update scripts for linux (#51) * [ORO-0] Add new scripts (#52) * [ORO-0] Add new scripts * [ORO-0] Add execute permissions to scripts * Fix Unit Test: getErrorString (#54) Signed-off-by: Chih-Chen Kao <[email protected]> Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] Support hiprtc0504 (#55) * [ORO-0] Update hiprtc and orortc error codes (#57) * [ORO-0] Update test scripts to delete cache before running (#58) * [ORO-0] Update hiprtc dlls * [ORO-0] Support gfx1100,gfx1102 for radix sort kernel precompilation * Fix apt python installation (#63) Update checkout version Signed-off-by: Chih-Chen Kao <[email protected]> Signed-off-by: Chih-Chen Kao <[email protected]> * [ORO-0] OrochiUtils update. (#61) * [ORO-0] Add WMMA test (#62) * [ORO-0] Add WMMA test * [ORO-0] Add a comment for WMMA * [ORO-0] Cleanup * [ORO-0] Add a couple more comments * [ORO-0] Remove hip_runtime include * [ORO-0] Cleanup * [ORO-0] Fix comment * [ORO-0] Add Copyright notice * [ORO-0] Load binary from the directory where DLL is. * [ORO-0] Fix for linux. --------- Signed-off-by: Chih-Chen Kao <[email protected]> Co-authored-by: Takahiro Harada <[email protected]> Co-authored-by: takahiroharada <[email protected]> Co-authored-by: Chih-Chen Kao <[email protected]> Co-authored-by: NevesLucas <[email protected]> Co-authored-by: Mehmet Oguz Derin <[email protected]> Co-authored-by: Daniel Meister <[email protected]> Co-authored-by: PixelClear <[email protected]>
- Loading branch information