Skip to content

Commit

Permalink
Sync with GPUOpen/LibrariesAndSDKs/Orochi (#44)
Browse files Browse the repository at this point in the history
* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <[email protected]>

* fix footnote markdown format (#39)

* Feature/oro 0 amdadvtech merge (#43)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 radix sort (#19)

* [ORO-0] Working 8 bit radix sort.

* [ORO-0] Some optimization.

* Create LICENSE

* Update README.md (#15)

* Feature/oro 0 raw get set (#19)

* [ORO-0] Rename setter and getter.

* [ORO-0] Fix when there is a dll but no device.

* [ORO-0] Deletion function.

* [ORO-0] Multi processor count.

* [ORO-0] Extended the sort to more than 8 bits. Implemented tests.

* [ORO-0] Moved temp buffer allocation out from the sort().

* [ORO-0] README. References.

* [ORO-0] Debug flag.

* Refactor the code to add the basic constructs to support selecting different scan algorithms.
Add different implementation of the scan algorithm: CPU, single WG and all WG .

Signed-off-by: Chih-Chen Kao <[email protected]>

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* Optimization: Implement the single-pass kernel for GPU parallel scan.
Fix a GPU memory bug.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 kernel cache (#4)

* [ORO-0] Cache kernel.

* [ORO-0] Support newer HIP builds on windows (#22)

* [ORO-0] Unit test. (#23)

* Fix LDS scan bug.
The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block).
Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work
because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap).

Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize the LDS scan algorithm. (#6)

* Optimize the LDS scan algorithm.
This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Support an input array in LDS that is 2 times the WG size.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 clean up (#7)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* Feature/oro 0 clean up (#10)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* [ORO-0] SortKernel1. Less complex. (#8)

SortKernel (occupancy: 8)
- vgpr: 128
- lds: 6704
SortKernel1 (occupancy: 9)
- vgpr: 106
- lds 7720

* [ORO-0] Kernel execution time check.

* Fix the memory access pattern and change it to coalesced memory access. (#11)

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Single kernel sort for small keys. (#12)

* Optimize the Count kernel for less LDS usage to achieve full occupancy (#13)

* Optimize the Count kernel to let it use less LDS and could achieve full occupancy.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Remove __threadfence_block()

Removes the boundary check in the inner loop.
The upper bound is set only once before going into the loop.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Introduce DRIVER and RTC APIs

* Disable enum-variant

* Improve paths

* Add fields

* Update Vulkan test

* Define CUDA in terms of DRIVER and RTC

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

* Merging another merge (#18)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15)

* Calculate the number of WGs based on LDS and max-thread-per-WGP.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Add a workaround for CUDA.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Implement key-value pair sorting (#17)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Add basic structure for key-value pair sorting.
Fix an error in single pass sort


Signed-off-by: Chih-Chen Kao <[email protected]>

* Add Value data in the test and sort it according to keys.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Support Key only sorting.

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Make single pass kernel non compile time switch.

* Support both Key-Only & Key-Value pair sort kernels


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Test change.

* [ORO-0] A bug.

* [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible.

Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>

* [ORO-0] Revert demo code.

* Fix missing CUDA properties.  (#26)

* Update Orochi.cpp

* [ORO-0] Clean up.

* [ORO-0] OroUtils. (#27)

* [ORO-0] OroUtils.

* [ORO-0] Linux build fix.

* [ORO-0] Forgot to add.

* [ORO-0] Linux build fix.

* [ORO-0] Clean up.

Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>

* Add kernel path and include dir to the functions. (#20)

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] BakeKernel. (#21)

* [ORO-0] BakeKernel.

* Update tools/genArgs.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/genArgs.py

dead code removal

* Update tools/stringify.py

dead code removal

* fix include

Signed-off-by: Chih-Chen Kao <[email protected]>

* fix script

Signed-off-by: Chih-Chen Kao <[email protected]>

* fix

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Fix Orochi CUDA API (#23)

Fix Orochi CUDA API 

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Linux build fix. (#22)

* [ORO-0] Linux build fix.

* Fix Orochi CUDA API


Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Quick fix for old linux gcc which does not support std::exclusive_scan (#24)

Quick fix for old linux gcc which does not support std::exclusive_scan

Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix the kernel cache bug. (#25)

Fix the kernel cache bug.

The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid.

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Remove static variables. (#26)

* [ORO-0] Remove static variables.

* [ORO-0] Applied the suggestions.

* [ORO-0] Linux regression fix.

* Fix OrochiUtils::getFunctionFromString API (#27)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Adding missing assert (#28)

* Adding missing assert

* Adding more asserts

* Feature/oro 0 gpuopen merge (#31)

* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <[email protected]>

* fix footnote markdown format (#39)

* Fix orochi utils issue in unit tests

Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>

Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
Co-authored-by: Daniel Meister <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>

* [ORO-0] bitcode/cubin linking APIs (#40)

* [ORO-0] Link apis.

* [ORO-0] Forgot to add.

* [ORO-0] Linking test.

* [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize

* [ORO-0] Update link unit tests with comments

* [ORO-0] Change test for CUBIN instead of PTX

* [ORO-0] Fix loadfile to use binary mode, remove printf in kernel

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Load amdhip first, then hiprtc.

* [ORO-0] Remove assert from hiprtc library checks

* [ORO-0] Add gfx1030 bitcode for navi21

* [MNN-0] Fix premake and add more link testcases

* [ORO-0] Update a link_null_name testcase

* [ORO-0] Make unit tests more stable on CUDA

* [ORO-0] Update bitcode for gfx1030

* [ORO-0] Add bitcodes for navi1,2, vega

* [ORO-0] Add hiprtc.dll and comgr dll

* [ORO-0] Add gfx906 bitcodes

* [ORO-0] Support unit tests on both HIP and CUDA

* [ORO-0] Update dlls and bitcodes

* [ORO-0] Update bitcodes and generation script

* [ORO-0] Minor fixes in bundled bitcode unit tests

* [ORO-0] Fix typo in options

* [ORO-0] Fix getCUBIN/PTX signatures

* [ORO-0] Fix unit tests and generate fatbin for CUDA

* [ORO-0] Regenerate fatbin and fix script

* [ORO-0] Cleanup

* [ORO-0] Update bundled bitcodes to only contain navi21 for now

* [ORO-0] Updated bundled bitcode

* [ORO-0] add ORO_LAUNCH_PARAMS_*

* [ORO-0] Add unit test for orortcLinkAddFile

* [ORO-0] Add unittest scripts for TC

* [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA

* [ORO-0] Add bitcode+bundled bitcode link test

* [ORO-0] Cleanup

* [ORO-0] Fix typo in script

* [ORO-0] Update linux TC script

Co-authored-by: takahiroharada <[email protected]>

* [ORO-0] Get global memory size for CUDA (#44)

* [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46)

* [ORO-0] Update HIP dll's for bitcode linking support

* [ORO-0] Add getLoweredName testcase

* [ORO-0] Update unittest filter

* [ORO-0] Update loweredName test

* [ORO-0] Add missing test kernel

* [ORO-0] Fix loweredName test

* [ORO-0] Fix linux compilation

* [ORO-0] Remove printf from test kernel (#37)

* [ORO-0] Fix linux loading of libhiprtc.so (#49)

* [ORO-0] Update test scripts (#50)

* [ORO-0] Update scripts for linux (#51)

* [ORO-0] Add new scripts (#52)

* [ORO-0] Add new scripts

* [ORO-0] Add execute permissions to scripts

* Fix Unit Test: getErrorString (#54)

Signed-off-by: Chih-Chen Kao <[email protected]>

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Support hiprtc0504 (#55)

* [ORO-0] Update hiprtc and orortc error codes (#57)

* [ORO-0] Update test scripts to delete cache before running (#58)

* [ORO-0] Update hiprtc dlls

* [ORO-0] Support gfx1100,gfx1102 for radix sort kernel precompilation

* Fix apt python installation (#63)

Update checkout version


Signed-off-by: Chih-Chen Kao <[email protected]>

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] OrochiUtils update. (#61)

* [ORO-0] Add WMMA test (#62)

* [ORO-0] Add WMMA test

* [ORO-0] Add a comment for WMMA

* [ORO-0] Cleanup

* [ORO-0] Add a couple more comments

* [ORO-0] Remove hip_runtime include

* [ORO-0] Cleanup

* [ORO-0] Fix comment

* [ORO-0] Add Copyright notice

* [ORO-0] Load binary from the directory where DLL is.

* [ORO-0] Fix for linux.

---------

Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
Co-authored-by: Daniel Meister <[email protected]>
Co-authored-by: PixelClear <[email protected]>
  • Loading branch information
8 people authored Jan 27, 2023
1 parent 6314b2b commit 2280e20
Show file tree
Hide file tree
Showing 33 changed files with 4,996 additions and 24 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@ jobs:
build-linux:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v3
- name: configure
run: sudo apt-get install python
run: sudo apt-get install python3
- name: chmod
run: chmod +x ./tools/premake5/linux64/premake5
- name: premake
Expand Down
11 changes: 11 additions & 0 deletions Orochi/Orochi.h
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,18 @@ typedef struct _orortcProgram* orortcProgram;
enum orortcResult
{
ORORTC_SUCCESS = 0,
ORORTC_ERROR_OUT_OF_MEMORY = 1,
ORORTC_ERROR_PROGRAM_CREATION_FAILURE = 2,
ORORTC_ERROR_INVALID_INPUT = 3,
ORORTC_ERROR_INVALID_PROGRAM = 4,
ORORTC_ERROR_INVALID_OPTION = 5,
ORORTC_ERROR_COMPILATION = 6,
ORORTC_ERROR_BUILTIN_OPERATION_FAILURE = 7,
ORORTC_ERROR_NO_NAME_EXPRESSIONS_AFTER_COMPILATION = 8,
ORORTC_ERROR_NO_LOWERED_NAMES_BEFORE_COMPILATION = 9,
ORORTC_ERROR_NAME_EXPRESSION_NOT_VALID = 10,
ORORTC_ERROR_INTERNAL_ERROR = 11,
ORORTC_ERROR_LINKING = 100
};

typedef enum oroEvent_flags_enum
Expand Down
38 changes: 37 additions & 1 deletion ParallelPrimitives/RadixSort.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@
// clang-format on
#endif

#if defined(__GNUC__)
#include <dlfcn.h>
#endif

namespace
{
#if defined( ORO_PRECOMPILED )
Expand All @@ -21,6 +25,22 @@ constexpr auto useBitCode = true;
constexpr auto useBitCode = false;
#endif

#if !defined(__GNUC__)
const HMODULE GetCurrentModule()
{
HMODULE hModule = NULL;
// hModule is NULL if GetModuleHandleEx fails.
GetModuleHandleEx( GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS | GET_MODULE_HANDLE_EX_FLAG_UNCHANGED_REFCOUNT, (LPCTSTR)GetCurrentModule, &hModule );
return hModule;
}
#else
void GetCurrentModule1()
{
}
#endif



void printKernelInfo( oroFunction func )
{
int numReg{};
Expand Down Expand Up @@ -83,11 +103,27 @@ void RadixSort::compileKernels( oroDevice device, OrochiUtils& oroutils, const s
const auto currentKernelPath{ ( kernelPath == "" ) ? defaultKernelPath : kernelPath };
const auto currentIncludeDir{ ( includeDir == "" ) ? defaultIncludeDir : includeDir };

auto getCurrentDir = []()
{
#if !defined(__GNUC__)
HMODULE hm = GetCurrentModule();
char buff[MAX_PATH];
GetModuleFileName( hm, buff, MAX_PATH );
#else
Dl_info info;
dladdr( (const void*)GetCurrentModule1, &info );
const char* buff = info.dli_fname;
#endif
std::string::size_type position = std::string( buff ).find_last_of( "\\/" );
return std::string( buff ).substr( 0, position ) + "/";
};

std::string binaryPath{};
if constexpr( useBitCode )
{
const bool isAmd = oroGetCurAPI( 0 ) == ORO_API_HIP;
binaryPath = isAmd ? "../bitcodes/oro_compiled_kernels.hipfb" : "../bitcodes/oro_compiled_kernels.fatbin";
binaryPath = getCurrentDir();
binaryPath += isAmd ? "oro_compiled_kernels.hipfb" : "oro_compiled_kernels.fatbin";
if( m_flags == Flag::LOG )
{
std::cout << "loading pre-compiled kernels at path : " << binaryPath << '\n';
Expand Down
Loading

0 comments on commit 2280e20

Please sign in to comment.