-
Notifications
You must be signed in to change notification settings - Fork 981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf tuning for gcc + aarch64 #176
base: main
Are you sure you want to change the base?
Conversation
@@ -233,9 +233,7 @@ static inline std::pair<size_t, bool> FindMatchLength(const char* s1, | |||
int shift = Bits::FindLSBSetNonZero64(xorval); | |||
size_t matched_bytes = shift >> 3; | |||
uint64_t a3 = UNALIGNED_LOAD64(s2 + 4); | |||
#ifndef __x86_64__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block was moved to #else to cleanly add an aarch64 case.
#elif defined(__aarch64__) | ||
asm("cmp %w[xorval], 0\n\t" | ||
"csel %x[a2], %[a3], %[a2], eq\n\t" | ||
: [a2] "+r" (a2) | ||
: [a3] "r" (a3) , [xorval] "r" (xorval) | ||
: "cc"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is what clang was generating for the default code on aarch64, but this the direct translation from the x86 asm. I verified that it is faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the generated code look before/after? Both GCC and LLVM should use CSEL for this, so I don't understand why this would help. Please file bug reports if not, otherwise it will never get fixed!
Note using random asm like this is generally a bad idea. If absolutely needed, it would be best to abstract it in a generic macro or inline function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note using random asm like this is generally a bad idea. If absolutely needed, it would be best to abstract it in a generic macro or inline function.
I agree! However, this code was already using inline asm like this right here for x86, and I am trying to follow local style. All I'm doing is adding a direct translation of the x86 asm to aarch64.
Both GCC and LLVM should use CSEL for this, so I don't understand why this would help. Please file bug reports if not, otherwise it will never get fixed!
The code already has a comment right above in the x86 case describing why clang doesn't use cmov:
Lines 237 to 245 in 5ec5d16
// Ideally this would just be | |
// | |
// a2 = static_cast<uint32_t>(xorval) == 0 ? a3 : a2; | |
// | |
// However clang correctly infers that the above statement participates on | |
// a critical data dependency chain and thus, unfortunately, refuses to | |
// use a conditional move (it's tuned to cut data dependencies). In this | |
// case there is a longer parallel chain anyway AND this will be fairly | |
// unpredictable. |
BM_ZFlat/0 105175 ns 105174 ns 6539 bytes_per_second=928.519M/s html (22.24 %)
BM_ZFlat/1 1385523 ns 1385503 ns 504 bytes_per_second=483.263M/s urls (47.84 %)
BM_ZFlat/2 6700 ns 6699 ns 104479 bytes_per_second=17.1121G/s jpg (99.95 %)
BM_ZFlat/3 333 ns 333 ns 2101538 bytes_per_second=572.801M/s jpg_200 (73.00 %)
BM_ZFlat/4 11754 ns 11754 ns 58783 bytes_per_second=8.11358G/s pdf (83.31 %)
BM_ZFlat/5 453779 ns 453780 ns 1540 bytes_per_second=860.825M/s html4 (22.52 %)
BM_ZFlat/6 430039 ns 430031 ns 1623 bytes_per_second=337.286M/s txt1 (57.87 %)
BM_ZFlat/7 374322 ns 374315 ns 1865 bytes_per_second=318.929M/s txt2 (62.02 %)
BM_ZFlat/8 1159839 ns 1159756 ns 601 bytes_per_second=350.923M/s txt3 (55.17 %)
BM_ZFlat/9 1523393 ns 1523306 ns 459 bytes_per_second=301.672M/s txt4 (66.41 %)
BM_ZFlat/10 97944 ns 97940 ns 6771 bytes_per_second=1.12766G/s pb (19.61 %)
BM_ZFlat/11 370635 ns 370624 ns 1881 bytes_per_second=474.284M/s gaviota (37.73 %)
BM_ZFlatAll 6184921 ns 6184674 ns 111 bytes_per_second=451.585M/s 12 kTestDataFiles
The full results with the change are in the PR description, but to summarize, BM_ZFlatAll
shows 473.433M/s, a 4.8% improvement. The pb
case (protobufs?) was the most improved at 16%.
Based on that comment, I'm not sure that a bug report would necessarily be helpful here. It appears that while compilers are able to generate csel/cmov here, they currently decide not to because they have reason to suspect that in the general case it would be more harmful than helpful. However, in this specific case, there is local knowledge that it is worth using cmov/csel here, and with a fairly large end-to-end impact. Extending from what is said in the comment, (some with my own wild guesses) one of the issues is that the desired code here adds an unconditional dependency on a load that the compiler would otherwise move inside the branch and therefore could be avoided entirely. In this case it is very very likely to already be in L1, but the compiler may not know that, and if it wasn't that would be very bad for perf. If that is part of the reason, then it seems totally reasonable for the compiler to be conservative about adding data dependencies to loads.
That said, I do plan to file some gcc bugs after this. In particular, the codegen for memmove-by-memcpy-through-temp-array is really bad and has even regressed from the last release.
What does the generated code look before/after?
It is a bit hard to compare because it this function is inlined into a larger one, and this change changes register allocations around and renumbers labels, but here is the relevant bit (with a few common instructions before and after):
rbit x6, x1
clz x6, x6
asr w2, w6, 3
cbnz w1, .L79
ldr x5, [x9, 8]
.L79:
add w2, w2, 4
and w6, w6, 24
sub x4, x9, x4
sxtw x2, w2
add x1, x9, x2
lsr x6, x5, x6
.L80:
lsl w5, w4, 8
rbit x1, x8
clz x1, x1
and w7, w1, 24
lsr x1, x1, 3
add x1, x1, 4
ldr x14, [x4, 8]
add x4, x4, x1
cmp w8, 0
csel x2, x14, x2, eq
lsr x7, x2, x7
.L79:
lsl w2, w11, 8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting back to this, GCC14 fixes the memmove-by-memcpy-through-temp-array issue (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618): https://www.godbolt.org/z/qMraefr1K
There is a redundant stack reservation for larger sizes but this won't be an issue in non-trivial functions.
// GCC dispatches to libc for memmoves > 16 bytes, so we need to | ||
// do some work to get good code from that compiler. Clang handles | ||
// powers-of-2 at least up to 64 well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sigh. This is really annoying and I don't feel good about this code. However, the perf impact is massive.
https://www.godbolt.org/z/EjP9zv3e7
It looks like gcc is better for aarch64 in trunk, but not in any released version (as of today), and not for x86_64 either.
#elif defined(__aarch64__) | ||
// Emperically it is faster to just copy all 64 rather than branching. | ||
(void)kShortMemCopy; | ||
(void)size; | ||
FixedSizeMemMove<64>(dst, src); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only novel part of the patch that isn't just massaging the code to do the same thing slightly differently. If you'd prefer, I can pull it out to a separate PR. I also tried move32(); if (size>32) moveAnother32()
(rather than dispatching to variable-length memmove
in that case) and the unconditional version was still faster.
Note that this makes the FixedSizeMemMove<32>
specialization dead code. So it may make sense to remove it if you keep this change. Alternatively, maybe keep it in case you add another fixed-size memmove of 32 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me all this is overengineering - we know that src < dst since it's typical for a decompressor to copy earlier bytes to append to the end of a buffer. Writing more bytes than requested would be impossible if dst < src since it will corrupt src.
So using memcpy is safe even if there is overlap beyond the requested copy size. The order of the emitted loads and stores doesn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me all this is overengineering
Are you referring to my change right here to unconditionally copy 64 bytes, the existing code (which was already calling memmove), or my introduction of FixedSizeMemMove
?
My only change at this location was to stop branching on size > 32
and just unconditionally copy all 64 bytes. The second copy is cheap enough to not be worth the overhead of making it conditional. This is of course cpu-dependent, but at least seems likely to be the case for any modern CPU core that can have multiple loads/stores in flight concurrently. I only did this change on aarch64 since that is what I was tuning on, but it is probably also worth testing a similar change for the x86_64 version immediately above this.
So using memcpy is safe even if there is overlap beyond the requested copy size.
I had considered just using memcpy because in practice it should be fine for this use case, however there were a few reasons I didn't:
- The justification for safety relies on memcpy copying bytes from front to back, but that isn't guaranteed. It would not be safe if some implementation decided to copy from back to front.
- According to the spec, it is immediate UB to call memcpy with overlapping buffers, not simply an unspecified result. This means that compilers could nasal demon you. That is probably unlikely here, but more problematic is that it would make it impossible to use UBSAN with snappy.
- I suspect (admittedly without testing) that the codegen for memcpy (which interleaves loads and stores) will have worse performance in this case. If there is overlap, a later load is no longer independent of earlier stores. This can trigger a store-to-load hazard in the CPU and prevent them from executing fully in parallel. To make it worse, this code will often be dealing with unaligned loads/stores* which are fast in the normal case, but have an even higher load-store penalty, often requiring a full flush of the store buffer. This is completely avoided by doing all loads before the stores. (Of course you could still have overlap with a prior call to
MemCopy64
, but that is much less problematic than having it within a single call)
* In addition to having adresses that are not naturally aligned, the potentially bigger problem is that the loads would often be differently aligned from the preceding stores, meaning that the load would need to pull from multiple stores which CPUs tend to punt on (or place severe restrictions on)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm referring to all the optimized code that does the copy, both your addition and the existing code. It's correct to use memcpy here. The order of the loads/stores does not matter, any order will work (ie. your point 1 does not hold). As it happens on AArch64 all loads are emitted before all stores so store-load hazards cannot occur (ie. point 3 is not an issue): https://godbolt.org/z/hva13E3z8
So it would be fine to prefer memmove on targets that emit it inline and use memcpy otherwise as the fallback (with UBSAN still prefer memmove). If people complained about inefficient codegen (x86-64 memcpy and memmove look bad in GCC) then there would be no need for any of these hacks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If people complained about inefficient codegen (x86-64 memcpy and memmove look bad in GCC) then there would be no need for any of these hacks
Just to be clear, you are preaching to the choir. I have reported several bad codegen bugs to both gcc and clang. However, even if they were all fixed in trunk tomorrow, that doesn't help me since I need to get good perf out of the compiler we use in production. And there can be over a year of lag between something hitting gcc truck and it being in a released compiler, since perf improvements are often not backported. Then I need to wait for us to upgrade our toolchain (which is why I'm testing against gcc 11.2 🙁) So there is still a need for perf hacks to convince current compilers to generate good code. I'll leave it up to the snappy maintainers whether they want to wait for compilers to improve or merge a change to get good perf out of existing compilers.
As it happens on AArch64 all loads are emitted before all stores so store-load hazards cannot occur
It looks like gcc has changed their codegen on trunk vs latest release. On released versions it does the interleaved loads and stores for memcpy. I'm not sure that doing the loads upfront in memcpy is always better especially after inlining into real code since that requires more registers (more of a problem for longer memcpys and when inlined into functions already under register pressure).
The order of the loads/stores does not matter, any order will work (ie. your point 1 does not hold)
Hmm, I think you're right, my bad.
I still don't love the idea of intentionally calling memcpy with overlapping ranges. But if that is what the snappy maintainers decide to do, then so be it.
const size_t next_tag = is_literal ? (*tag >> 2) + 1 : tag_type; | ||
*tag = ip[next_tag]; | ||
ip += (next_tag) + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the change that caused the regression on validate. I checked that with this change it is generating the desired csinc
instruction, which the comment above says is the intended codegen. Here is the perf with just this change reverted:
BM_UFlat/0 36099 ns 36098 ns 19334 bytes_per_second=2.64188G/s html
BM_UFlat/1 457307 ns 457297 ns 1530 bytes_per_second=1.42986G/s urls
BM_UFlat/2 3956 ns 3956 ns 176952 bytes_per_second=28.9819G/s jpg
BM_UFlat/3 160 ns 160 ns 4394167 bytes_per_second=1.16619G/s jpg_200
BM_UFlat/4 5788 ns 5788 ns 120704 bytes_per_second=16.478G/s pdf
BM_UFlat/5 155126 ns 155115 ns 4509 bytes_per_second=2.45927G/s html4
BM_UFlat/6 171845 ns 171839 ns 4073 bytes_per_second=844.067M/s txt1
BM_UFlat/7 154670 ns 154670 ns 4525 bytes_per_second=771.836M/s txt2
BM_UFlat/8 452391 ns 452362 ns 1547 bytes_per_second=899.687M/s txt3
BM_UFlat/9 649864 ns 649864 ns 1077 bytes_per_second=707.13M/s txt4
BM_UFlat/10 31264 ns 31263 ns 22349 bytes_per_second=3.53272G/s pb
BM_UFlat/11 148201 ns 148197 ns 4721 bytes_per_second=1.15833G/s gaviota
BM_UFlatMedley 2295899 ns 2295829 ns 305 bytes_per_second=1.188G/s
BM_UValidate/0 20800 ns 20799 ns 33000 bytes_per_second=4.5851G/s html
BM_UValidate/1 345447 ns 345427 ns 2020 bytes_per_second=1.89293G/s urls
BM_UValidate/2 138 ns 138 ns 5059972 bytes_per_second=828.941G/s jpg
BM_UValidate/3 96.5 ns 96.5 ns 7215460 bytes_per_second=1.93108G/s jpg_200
BM_UValidate/4 1972 ns 1972 ns 354938 bytes_per_second=48.3615G/s pdf
BM_UValidate/5 107941 ns 107936 ns 6443 bytes_per_second=3.53424G/s html4
BM_UValidate/6 126895 ns 126888 ns 5501 bytes_per_second=1.11629G/s txt1
BM_UValidate/7 114857 ns 114856 ns 6066 bytes_per_second=1039.39M/s txt2
BM_UValidate/8 334870 ns 334859 ns 2087 bytes_per_second=1.18691G/s txt3
BM_UValidate/9 486272 ns 486261 ns 1443 bytes_per_second=945.045M/s txt4
BM_UValidate/10 17879 ns 17878 ns 38501 bytes_per_second=6.17749G/s pb
BM_UValidate/11 92356 ns 92352 ns 7539 bytes_per_second=1.85876G/s gaviota
BM_UValidateMedley 1697835 ns 1697754 ns 412 bytes_per_second=1.6065G/s
BM_UIOVecSource/0 103097 ns 103092 ns 6599 bytes_per_second=947.274M/s html (22.24 %)
BM_UIOVecSource/1 1363926 ns 1363884 ns 512 bytes_per_second=490.923M/s urls (47.84 %)
BM_UIOVecSource/2 11296 ns 11296 ns 61854 bytes_per_second=10.1487G/s jpg (99.95 %)
BM_UIOVecSource/3 402 ns 402 ns 1740237 bytes_per_second=475.039M/s jpg_200 (73.00 %)
BM_UIOVecSource/4 15463 ns 15463 ns 44743 bytes_per_second=6.16753G/s pdf (83.31 %)
BM_UIOVecSource/5 442147 ns 442130 ns 1580 bytes_per_second=883.507M/s html4 (22.52 %)
BM_UIOVecSource/6 416784 ns 416773 ns 1672 bytes_per_second=348.015M/s txt1 (57.87 %)
BM_UIOVecSource/7 367446 ns 367440 ns 1897 bytes_per_second=324.897M/s txt2 (62.02 %)
BM_UIOVecSource/8 1121587 ns 1121523 ns 622 bytes_per_second=362.885M/s txt3 (55.17 %)
BM_UIOVecSource/9 1522044 ns 1521967 ns 459 bytes_per_second=301.937M/s txt4 (66.41 %)
BM_UIOVecSource/10 90576 ns 90574 ns 7349 bytes_per_second=1.21938G/s pb (19.61 %)
BM_UIOVecSource/11 354641 ns 354634 ns 1954 bytes_per_second=495.669M/s gaviota (37.73 %)
BM_UIOVecSink/0 102030 ns 102026 ns 6793 bytes_per_second=957.168M/s html
BM_UIOVecSink/1 954178 ns 954179 ns 732 bytes_per_second=701.715M/s urls
BM_UIOVecSink/2 4264 ns 4263 ns 164116 bytes_per_second=26.8888G/s jpg
BM_UIOVecSink/3 321 ns 321 ns 2178205 bytes_per_second=593.638M/s jpg_200
BM_UIOVecSink/4 9772 ns 9772 ns 70791 bytes_per_second=9.75947G/s pdf
BM_UFlatSink/0 36097 ns 36096 ns 19360 bytes_per_second=2.64205G/s html
BM_UFlatSink/1 457191 ns 457142 ns 1530 bytes_per_second=1.43034G/s urls
BM_UFlatSink/2 3920 ns 3919 ns 178767 bytes_per_second=29.2498G/s jpg
BM_UFlatSink/3 165 ns 165 ns 4253941 bytes_per_second=1.13175G/s jpg_200
BM_UFlatSink/4 5820 ns 5820 ns 119850 bytes_per_second=16.3866G/s pdf
BM_UFlatSink/5 154998 ns 154992 ns 4515 bytes_per_second=2.46122G/s html4
BM_UFlatSink/6 171897 ns 171891 ns 4075 bytes_per_second=843.809M/s txt1
BM_UFlatSink/7 154886 ns 154878 ns 4513 bytes_per_second=770.801M/s txt2
BM_UFlatSink/8 452911 ns 452904 ns 1545 bytes_per_second=898.611M/s txt3
BM_UFlatSink/9 649481 ns 649457 ns 1075 bytes_per_second=707.573M/s txt4
BM_UFlatSink/10 31247 ns 31246 ns 22371 bytes_per_second=3.53464G/s pb
BM_UFlatSink/11 148706 ns 148703 ns 4701 bytes_per_second=1.15439G/s gaviota
BM_ZFlat/0 96115 ns 96112 ns 7178 bytes_per_second=1016.07M/s html (22.24 %)
BM_ZFlat/1 1339282 ns 1339210 ns 521 bytes_per_second=499.968M/s urls (47.84 %)
BM_ZFlat/2 6655 ns 6654 ns 104791 bytes_per_second=17.2275G/s jpg (99.95 %)
BM_ZFlat/3 328 ns 328 ns 2136911 bytes_per_second=580.819M/s jpg_200 (73.00 %)
BM_ZFlat/4 11728 ns 11728 ns 58781 bytes_per_second=8.13163G/s pdf (83.31 %)
BM_ZFlat/5 424216 ns 424212 ns 1653 bytes_per_second=920.824M/s html4 (22.52 %)
BM_ZFlat/6 411154 ns 411134 ns 1696 bytes_per_second=352.789M/s txt1 (57.87 %)
BM_ZFlat/7 362120 ns 362110 ns 1925 bytes_per_second=329.679M/s txt2 (62.02 %)
BM_ZFlat/8 1106496 ns 1106418 ns 630 bytes_per_second=367.84M/s txt3 (55.17 %)
BM_ZFlat/9 1501500 ns 1501436 ns 465 bytes_per_second=306.066M/s txt4 (66.41 %)
BM_ZFlat/10 86265 ns 86262 ns 7658 bytes_per_second=1.28032G/s pb (19.61 %)
BM_ZFlat/11 347773 ns 347764 ns 1998 bytes_per_second=505.462M/s gaviota (37.73 %)
BM_ZFlatAll 5898864 ns 5898386 ns 117 bytes_per_second=473.503M/s 12 kTestDataFiles
BM_ZFlatIncreasingTableSize 33035 ns 33035 ns 20928 bytes_per_second=938.586M/s 7 tables
It is a mixed bag for Validate, but even in the cases where it regresses there, it is still faster for actual decompression.
Obviously, all of my claims of "faster" are only on the core (graviton2/neoverse-n1) and compiler (gcc-11.2) that I tested on. I assume that you will be able to run this through a gauntlet of platforms that you care about prior to merging to ensure there are no regressions there. |
} | ||
} | ||
|
||
#ifdef __aarch64__ // Implies neon support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#ifdef __aarch64__ // Implies neon support | |
#if SNAPPY_HAVE_VECTOR_BYTE_SHUFFLE |
Just in case there is a weird ARM platform that doesn't have the arm_neon
header, it may make more sense to ifdef this on the same condition that V128
is defined on, on line 94 a few lines up.
Hi, sorry for a long delay. I promise to take a look and hopefully merge it this week. At Google we don't care much about gcc and thus this was overlooked but we understand the need for the community |
The current codebase appears to have been tuned under clang, and there are a handful of patterns that gcc doesn't recognize that are critical to perf. Benchmarks were run on a graviton2/neoverse-n1 core with gcc-11.2.
New Perf:
Old Perf:
Most tests are either the same or better (often MUCH better). The only regressions are some of the validation tests, especially the ones that are already well over 1GB/s. Some of the validations are also faster, including the ValidateMedley test (1.60 -> 1.94GB/s). I tracked that down to a single change, and there is already a comment saying that the intent is to get the new codegen.