-
Notifications
You must be signed in to change notification settings - Fork 981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf tuning for gcc + aarch64 #176
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -233,9 +233,7 @@ static inline std::pair<size_t, bool> FindMatchLength(const char* s1, | |||||||||||||||||||
int shift = Bits::FindLSBSetNonZero64(xorval); | ||||||||||||||||||||
size_t matched_bytes = shift >> 3; | ||||||||||||||||||||
uint64_t a3 = UNALIGNED_LOAD64(s2 + 4); | ||||||||||||||||||||
#ifndef __x86_64__ | ||||||||||||||||||||
a2 = static_cast<uint32_t>(xorval) == 0 ? a3 : a2; | ||||||||||||||||||||
#else | ||||||||||||||||||||
#ifdef __x86_64__ | ||||||||||||||||||||
// Ideally this would just be | ||||||||||||||||||||
// | ||||||||||||||||||||
// a2 = static_cast<uint32_t>(xorval) == 0 ? a3 : a2; | ||||||||||||||||||||
|
@@ -250,6 +248,14 @@ static inline std::pair<size_t, bool> FindMatchLength(const char* s1, | |||||||||||||||||||
: "+r"(a2) | ||||||||||||||||||||
: "r"(a3), "r"(xorval) | ||||||||||||||||||||
: "cc"); | ||||||||||||||||||||
#elif defined(__aarch64__) | ||||||||||||||||||||
asm("cmp %w[xorval], 0\n\t" | ||||||||||||||||||||
"csel %x[a2], %[a3], %[a2], eq\n\t" | ||||||||||||||||||||
: [a2] "+r" (a2) | ||||||||||||||||||||
: [a3] "r" (a3) , [xorval] "r" (xorval) | ||||||||||||||||||||
: "cc"); | ||||||||||||||||||||
Comment on lines
+251
to
+256
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure if this is what clang was generating for the default code on aarch64, but this the direct translation from the x86 asm. I verified that it is faster. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does the generated code look before/after? Both GCC and LLVM should use CSEL for this, so I don't understand why this would help. Please file bug reports if not, otherwise it will never get fixed! Note using random asm like this is generally a bad idea. If absolutely needed, it would be best to abstract it in a generic macro or inline function. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I agree! However, this code was already using inline asm like this right here for x86, and I am trying to follow local style. All I'm doing is adding a direct translation of the x86 asm to aarch64.
The code already has a comment right above in the x86 case describing why clang doesn't use cmov: Lines 237 to 245 in 5ec5d16
The full results with the change are in the PR description, but to summarize, Based on that comment, I'm not sure that a bug report would necessarily be helpful here. It appears that while compilers are able to generate csel/cmov here, they currently decide not to because they have reason to suspect that in the general case it would be more harmful than helpful. However, in this specific case, there is local knowledge that it is worth using cmov/csel here, and with a fairly large end-to-end impact. Extending from what is said in the comment, (some with my own wild guesses) one of the issues is that the desired code here adds an unconditional dependency on a load that the compiler would otherwise move inside the branch and therefore could be avoided entirely. In this case it is very very likely to already be in L1, but the compiler may not know that, and if it wasn't that would be very bad for perf. If that is part of the reason, then it seems totally reasonable for the compiler to be conservative about adding data dependencies to loads. That said, I do plan to file some gcc bugs after this. In particular, the codegen for memmove-by-memcpy-through-temp-array is really bad and has even regressed from the last release.
It is a bit hard to compare because it this function is inlined into a larger one, and this change changes register allocations around and renumbers labels, but here is the relevant bit (with a few common instructions before and after): rbit x6, x1
clz x6, x6
asr w2, w6, 3
cbnz w1, .L79
ldr x5, [x9, 8]
.L79:
add w2, w2, 4
and w6, w6, 24
sub x4, x9, x4
sxtw x2, w2
add x1, x9, x2
lsr x6, x5, x6
.L80:
lsl w5, w4, 8 rbit x1, x8
clz x1, x1
and w7, w1, 24
lsr x1, x1, 3
add x1, x1, 4
ldr x14, [x4, 8]
add x4, x4, x1
cmp w8, 0
csel x2, x14, x2, eq
lsr x7, x2, x7
.L79:
lsl w2, w11, 8 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Getting back to this, GCC14 fixes the memmove-by-memcpy-through-temp-array issue (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618): https://www.godbolt.org/z/qMraefr1K There is a redundant stack reservation for larger sizes but this won't be an issue in non-trivial functions. |
||||||||||||||||||||
#else | ||||||||||||||||||||
a2 = static_cast<uint32_t>(xorval) == 0 ? a3 : a2; | ||||||||||||||||||||
#endif | ||||||||||||||||||||
*data = a2 >> (shift & (3 * 8)); | ||||||||||||||||||||
return std::pair<size_t, bool>(matched_bytes, true); | ||||||||||||||||||||
|
@@ -276,14 +282,20 @@ static inline std::pair<size_t, bool> FindMatchLength(const char* s1, | |||||||||||||||||||
int shift = Bits::FindLSBSetNonZero64(xorval); | ||||||||||||||||||||
size_t matched_bytes = shift >> 3; | ||||||||||||||||||||
uint64_t a3 = UNALIGNED_LOAD64(s2 + 4); | ||||||||||||||||||||
#ifndef __x86_64__ | ||||||||||||||||||||
a2 = static_cast<uint32_t>(xorval) == 0 ? a3 : a2; | ||||||||||||||||||||
#else | ||||||||||||||||||||
#ifdef __x86_64__ | ||||||||||||||||||||
asm("testl %k2, %k2\n\t" | ||||||||||||||||||||
"cmovzq %1, %0\n\t" | ||||||||||||||||||||
: "+r"(a2) | ||||||||||||||||||||
: "r"(a3), "r"(xorval) | ||||||||||||||||||||
: "cc"); | ||||||||||||||||||||
#elif defined(__aarch64__) | ||||||||||||||||||||
asm("cmp %w[xorval], 0\n\t" | ||||||||||||||||||||
"csel %x[a2], %[a3], %[a2], eq\n\t" | ||||||||||||||||||||
: [a2] "+r" (a2) | ||||||||||||||||||||
: [a3] "r" (a3) , [xorval] "r" (xorval) | ||||||||||||||||||||
: "cc"); | ||||||||||||||||||||
#else | ||||||||||||||||||||
a2 = static_cast<uint32_t>(xorval) == 0 ? a3 : a2; | ||||||||||||||||||||
#endif | ||||||||||||||||||||
*data = a2 >> (shift & (3 * 8)); | ||||||||||||||||||||
matched += matched_bytes; | ||||||||||||||||||||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -100,6 +100,57 @@ using internal::V128_StoreU; | |||||
using internal::V128_DupChar; | ||||||
#endif | ||||||
|
||||||
// GCC dispatches to libc for memmoves > 16 bytes, so we need to | ||||||
// do some work to get good code from that compiler. Clang handles | ||||||
// powers-of-2 at least up to 64 well. | ||||||
Comment on lines
+103
to
+105
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sigh. This is really annoying and I don't feel good about this code. However, the perf impact is massive. https://www.godbolt.org/z/EjP9zv3e7 It looks like gcc is better for aarch64 in trunk, but not in any released version (as of today), and not for x86_64 either. |
||||||
#if !defined(__GNUC__) || defined(__clang__) | ||||||
template <size_t SIZE> | ||||||
SNAPPY_ATTRIBUTE_ALWAYS_INLINE | ||||||
inline void FixedSizeMemMove(void* dest, const void* src) { | ||||||
memmove(dest, src, SIZE); | ||||||
} | ||||||
#else | ||||||
|
||||||
template <size_t SIZE> | ||||||
SNAPPY_ATTRIBUTE_ALWAYS_INLINE | ||||||
inline void FixedSizeMemMove(void* dest, const void* src) { | ||||||
if (SIZE <= 16) { | ||||||
// gcc has patterns for memmove up to 16 bytes | ||||||
memmove(dest, src, SIZE); | ||||||
} else { | ||||||
// This generates reasonable code on x86_64, but on aarch64 this produces a | ||||||
// dead store to tmp, plus takes up stack space. | ||||||
char tmp[SIZE]; | ||||||
memcpy(tmp, src, SIZE); | ||||||
memcpy(dest, tmp, SIZE); | ||||||
} | ||||||
} | ||||||
|
||||||
#ifdef __aarch64__ // Implies neon support | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Just in case there is a weird ARM platform that doesn't have the |
||||||
template <> | ||||||
SNAPPY_ATTRIBUTE_ALWAYS_INLINE | ||||||
inline void FixedSizeMemMove<32>(void* dest, const void* src) { | ||||||
V128 a = V128_LoadU(reinterpret_cast<const V128*>(src)); | ||||||
V128 b = V128_LoadU(reinterpret_cast<const V128*>(src) + 1); | ||||||
V128_StoreU(reinterpret_cast<V128*>(dest), a); | ||||||
V128_StoreU(reinterpret_cast<V128*>(dest) + 1, b); | ||||||
} | ||||||
|
||||||
template <> | ||||||
SNAPPY_ATTRIBUTE_ALWAYS_INLINE | ||||||
inline void FixedSizeMemMove<64>(void* dest, const void* src) { | ||||||
V128 a = V128_LoadU(reinterpret_cast<const V128*>(src)); | ||||||
V128 b = V128_LoadU(reinterpret_cast<const V128*>(src) + 1); | ||||||
V128 c = V128_LoadU(reinterpret_cast<const V128*>(src) + 2); | ||||||
V128 d = V128_LoadU(reinterpret_cast<const V128*>(src) + 3); | ||||||
V128_StoreU(reinterpret_cast<V128*>(dest), a); | ||||||
V128_StoreU(reinterpret_cast<V128*>(dest) + 1, b); | ||||||
V128_StoreU(reinterpret_cast<V128*>(dest) + 2, c); | ||||||
V128_StoreU(reinterpret_cast<V128*>(dest) + 3, d); | ||||||
} | ||||||
#endif | ||||||
#endif | ||||||
|
||||||
// We translate the information encoded in a tag through a lookup table to a | ||||||
// format that requires fewer instructions to decode. Effectively we store | ||||||
// the length minus the tag part of the offset. The lowest significant byte | ||||||
|
@@ -1060,13 +1111,18 @@ void MemCopy64(char* dst, const void* src, size_t size) { | |||||
data = _mm256_lddqu_si256(static_cast<const __m256i *>(src) + 1); | ||||||
_mm256_storeu_si256(reinterpret_cast<__m256i *>(dst) + 1, data); | ||||||
} | ||||||
#elif defined(__aarch64__) | ||||||
// Emperically it is faster to just copy all 64 rather than branching. | ||||||
(void)kShortMemCopy; | ||||||
(void)size; | ||||||
FixedSizeMemMove<64>(dst, src); | ||||||
Comment on lines
+1114
to
+1118
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the only novel part of the patch that isn't just massaging the code to do the same thing slightly differently. If you'd prefer, I can pull it out to a separate PR. I also tried Note that this makes the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems to me all this is overengineering - we know that src < dst since it's typical for a decompressor to copy earlier bytes to append to the end of a buffer. Writing more bytes than requested would be impossible if dst < src since it will corrupt src. So using memcpy is safe even if there is overlap beyond the requested copy size. The order of the emitted loads and stores doesn't matter. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Are you referring to my change right here to unconditionally copy 64 bytes, the existing code (which was already calling memmove), or my introduction of My only change at this location was to stop branching on
I had considered just using memcpy because in practice it should be fine for this use case, however there were a few reasons I didn't:
* In addition to having adresses that are not naturally aligned, the potentially bigger problem is that the loads would often be differently aligned from the preceding stores, meaning that the load would need to pull from multiple stores which CPUs tend to punt on (or place severe restrictions on) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm referring to all the optimized code that does the copy, both your addition and the existing code. It's correct to use memcpy here. The order of the loads/stores does not matter, any order will work (ie. your point 1 does not hold). As it happens on AArch64 all loads are emitted before all stores so store-load hazards cannot occur (ie. point 3 is not an issue): https://godbolt.org/z/hva13E3z8 So it would be fine to prefer memmove on targets that emit it inline and use memcpy otherwise as the fallback (with UBSAN still prefer memmove). If people complained about inefficient codegen (x86-64 memcpy and memmove look bad in GCC) then there would be no need for any of these hacks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Just to be clear, you are preaching to the choir. I have reported several bad codegen bugs to both gcc and clang. However, even if they were all fixed in trunk tomorrow, that doesn't help me since I need to get good perf out of the compiler we use in production. And there can be over a year of lag between something hitting gcc truck and it being in a released compiler, since perf improvements are often not backported. Then I need to wait for us to upgrade our toolchain (which is why I'm testing against gcc 11.2 🙁) So there is still a need for perf hacks to convince current compilers to generate good code. I'll leave it up to the snappy maintainers whether they want to wait for compilers to improve or merge a change to get good perf out of existing compilers.
It looks like gcc has changed their codegen on trunk vs latest release. On released versions it does the interleaved loads and stores for memcpy. I'm not sure that doing the loads upfront in memcpy is always better especially after inlining into real code since that requires more registers (more of a problem for longer memcpys and when inlined into functions already under register pressure).
I still don't love the idea of intentionally calling memcpy with overlapping ranges. But if that is what the snappy maintainers decide to do, then so be it. |
||||||
#else | ||||||
std::memmove(dst, src, kShortMemCopy); | ||||||
FixedSizeMemMove<kShortMemCopy>(dst, src); | ||||||
// Profiling shows that nearly all copies are short. | ||||||
if (SNAPPY_PREDICT_FALSE(size > kShortMemCopy)) { | ||||||
std::memmove(dst + kShortMemCopy, | ||||||
static_cast<const uint8_t*>(src) + kShortMemCopy, | ||||||
64 - kShortMemCopy); | ||||||
FixedSizeMemMove<kShortMemCopy>( | ||||||
dst + kShortMemCopy, | ||||||
static_cast<const uint8_t*>(src) + kShortMemCopy); | ||||||
} | ||||||
#endif | ||||||
} | ||||||
|
@@ -1102,14 +1158,9 @@ inline size_t AdvanceToNextTagARMOptimized(const uint8_t** ip_p, size_t* tag) { | |||||
// instruction (csinc) and it removes several register moves. | ||||||
const size_t tag_type = *tag & 3; | ||||||
const bool is_literal = (tag_type == 0); | ||||||
if (is_literal) { | ||||||
size_t next_literal_tag = (*tag >> 2) + 1; | ||||||
*tag = ip[next_literal_tag]; | ||||||
ip += next_literal_tag + 1; | ||||||
} else { | ||||||
*tag = ip[tag_type]; | ||||||
ip += tag_type + 1; | ||||||
} | ||||||
const size_t next_tag = is_literal ? (*tag >> 2) + 1 : tag_type; | ||||||
*tag = ip[next_tag]; | ||||||
ip += (next_tag) + 1; | ||||||
Comment on lines
+1161
to
+1163
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the change that caused the regression on validate. I checked that with this change it is generating the desired
It is a mixed bag for Validate, but even in the cases where it regresses there, it is still faster for actual decompression. |
||||||
return tag_type; | ||||||
} | ||||||
|
||||||
|
@@ -2013,7 +2064,7 @@ class SnappyArrayWriter { | |||||
*op_p = IncrementalCopy(op - offset, op, op_end, op_limit_); | ||||||
return true; | ||||||
} | ||||||
std::memmove(op, op - offset, kSlopBytes); | ||||||
FixedSizeMemMove<kSlopBytes>(op, op - offset); | ||||||
*op_p = op_end; | ||||||
return true; | ||||||
} | ||||||
|
@@ -2265,7 +2316,7 @@ class SnappyScatteredWriter { | |||||
} | ||||||
// Fast path | ||||||
char* const op_end = op + len; | ||||||
std::memmove(op, op - offset, kSlopBytes); | ||||||
FixedSizeMemMove<kSlopBytes>(op, op - offset); | ||||||
*op_p = op_end; | ||||||
return true; | ||||||
} | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block was moved to #else to cleanly add an aarch64 case.