-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for AVX10.2, Add AVX10.2 API surface and template tests #111209
Add support for AVX10.2, Add AVX10.2 API surface and template tests #111209
Conversation
Note regarding the
|
1 similar comment
Note regarding the
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
ec7732d
to
216999c
Compare
7f2b5ae
to
83868ab
Compare
src/coreclr/jit/emitxarch.cpp
Outdated
// Reserved for isas Avx10.1 and below | ||
// Needs to be set to 0 for AVX10.2 adn above to indicate YMM embedded rounding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
// Reserved for isas Avx10.1 and below | |
// Needs to be set to 0 for AVX10.2 adn above to indicate YMM embedded rounding | |
// Reserved for isas Avx10.1 and below | |
// Set to 0 on AVX10.2 and above for YMM embedded rounding support |
Or something along those lines. The current wording makes it sound like it must be always set to 0, rather than only conditionally set if we're using embedded rounding for YMM sizes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe on Avx10.1 and below it is required to be set to 1
as well, so this should likely indicate that. Simply reserved doesn't imply a necessary state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed these changes in my latest commit
src/coreclr/jit/emitxarch.cpp
Outdated
// Scalar Double Scalar Single Packed Double | ||
return ((b == 0xF2) || (b == 0xF3) || (b == 0x66)); | ||
// Scalar Double Scalar Single Packed Double No prefix | ||
return ((b == 0xF2) || (b == 0xF3) || (b == 0x66) || (b == 0x00)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change isn't necessary given the assert(b != 0
above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed these changes in my latest commit
@@ -2314,6 +2346,13 @@ emitter::code_t emitter::emitExtractEvexPrefix(instruction ins, code_t& code) co | |||
break; | |||
} | |||
|
|||
case 0x05: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to handle leadingBytes
being [0x00, 0x04]
and [0x06, 0x07]
?
The higher assert seems to indicate it can be any of those, but you've only added 0x05
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of now we dont have those instructions. But yes, I think it is better to have to handle those as invalid as of now since we dont have those instructions yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed these changes in my latest commit
HARDWARE_INTRINSIC(AVX10v2_V512, ConvertToVectorUInt32WithTruncationSaturation, 64, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_vcvttps2udqs, INS_vcvttpd2udqs}, HW_Category_SimpleSIMD, HW_Flag_BaseTypeFromFirstArg|HW_Flag_EmbBroadcastCompatible|HW_Flag_EmbMaskingCompatible) | ||
HARDWARE_INTRINSIC(AVX10v2_V512, ConvertToVectorUInt64WithTruncationSaturation, 64, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_vcvttps2uqqs, INS_vcvttpd2uqqs}, HW_Category_SimpleSIMD, HW_Flag_BaseTypeFromFirstArg|HW_Flag_EmbBroadcastCompatible|HW_Flag_EmbMaskingCompatible) | ||
HARDWARE_INTRINSIC(AVX10v2_V512, MinMax, 64, 3, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_vminmaxps, INS_vminmaxpd}, HW_Category_IMM, HW_Flag_BaseTypeFromFirstArg|HW_Flag_EmbBroadcastCompatible|HW_Flag_EmbMaskingCompatible) | ||
HARDWARE_INTRINSIC(AVX10v2_V512, MultipleSumAbsoluteDifferences, 64, 3, {INS_vmpsadbw, INS_invalid, INS_invalid, INS_vmpsadbw, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_IMM, HW_Flag_FullRangeIMM|HW_Flag_EmbMaskingCompatible) // TBD where should we put the instruction typ_byte or typ_ushort? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the TBD
, this has to match the tracked simdBaseType
that the node has.
The node typically defaults to the base type of the SIMD return as that is generally unambiguous and matches the types used for the inputs. However, if there are conflicts there then we switch to the base type of the first or second argument, depending on which is "unique" and allows disambiguation (in the same way it would for overload resolution).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. I missed the TBD comment here. So what you are saying is the simdBaseType
of the node needs to be matched here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed these changes in my latest commit
The changes overall LGTM. Just a couple small cleanup asks. It should generally be good for secondary review, CC. @dotnet/jit-contrib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me except for one nit and one thing in lookupInstructionSet
that looks like a bug.
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/Avx10v2.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: Bruce Forstall <[email protected]>
Thanks for all the work here @khushal1996! |
+1 Thanks @khushal1996! |
Thanks @tannergooding @BruceForstall |
This PR covers implementation of approved AVX10.2 APIs along with addition of corresponding template tests and lowering support in JIT.
It enables ymm embedded rounding which comes with AVX10.2
Testing overview
We follow a multi-step testing plan to verify the encoding correctness and the semantic correctness.
Testing results will be presented below.
In codgenxarch.cpp, similar to genAmd64EmitterUnitTestsSse2, we used the JitLateDisasm feature to insert instructions to encode as unit tests for emitter, and LateDisasm will invoke LLVM to disasm the code stream, this gave us the chance to cross validate the disassembly from JIT and LLVM. The output of this step is to verify the emit paths are generating "correct" code that would not trigger #UD or have wrong semantics.
Note that we are using a custom coredistools.dll which uses a recent LLVM that supports AVX10.2 decoding.
In this step, we would run the SuperPMI pipeline to get the asmdiffs,; the inputs are all the MCH files. This step will give us the chance to check if there is any assertion failure or internal error within JIT and since the pipeline will invoke coredistools.dll as well, so we can verify the encoding correctness in a larger scope.
To ensure the new changes will not hit the existing code path in terms of throughput, we ran asmdiffs with base JIT to be the main branch where changes are based on, and diff JIT to be the one with all the changes.
The 2 steps mentioned above are mainly verifying the encoding correctness of the generated binary code. In this step we have used the existing CoreCLR unit test set: JIT and run it in the Intel SDE emulator with
AVX10.2
on and off.Testing results
Run Emitter tests
Result of emitter tests using LLVM disassembler (left - JIT emitted code, right - LLVM disassembler output)
Run superpmi using JITLateDisasm to check for errors
No Decode failures observed in superpmi log.
Running superpmi without JITLateDisasm to check for assert errors
No assertion failures or asm diffs observed
Run JIT subtree with AVX10.2 enabled / disabled
AVX10.2 enabled
AVX10.2 disabled