Perf tuning for gcc + aarch64 #176

RedBeard0531 · 2024-01-18T15:25:00Z

The current codebase appears to have been tuned under clang, and there are a handful of patterns that gcc doesn't recognize that are critical to perf. Benchmarks were run on a graviton2/neoverse-n1 core with gcc-11.2.

New Perf:

2024-01-18T15:16:45+00:00
Running build/snappy_benchmark
Run on (16 X 243.75 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x16)
  L1 Instruction 64 KiB (x16)
  L2 Unified 1024 KiB (x16)
  L3 Unified 32768 KiB (x1)
Load Average: 0.14, 0.45, 0.44
--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
BM_UFlat/0                       30579 ns        30578 ns        22886 bytes_per_second=3.11878G/s html
BM_UFlat/1                      332963 ns       332964 ns         2102 bytes_per_second=1.96378G/s urls
BM_UFlat/2                        3957 ns         3957 ns       176929 bytes_per_second=28.9678G/s jpg
BM_UFlat/3                         158 ns          158 ns      4420971 bytes_per_second=1.17675G/s jpg_200
BM_UFlat/4                        5864 ns         5864 ns       119168 bytes_per_second=16.2639G/s pdf
BM_UFlat/5                      123479 ns       123474 ns         5669 bytes_per_second=3.08947G/s html4
BM_UFlat/6                      131128 ns       131126 ns         5337 bytes_per_second=1106.14M/s txt1
BM_UFlat/7                      115026 ns       115024 ns         6085 bytes_per_second=1037.87M/s txt2
BM_UFlat/8                      344359 ns       344347 ns         2033 bytes_per_second=1.1542G/s txt3
BM_UFlat/9                      475850 ns       475842 ns         1471 bytes_per_second=965.738M/s txt4
BM_UFlat/10                      26289 ns        26289 ns        26632 bytes_per_second=4.20121G/s pb
BM_UFlat/11                     140067 ns       140067 ns         4995 bytes_per_second=1.22556G/s gaviota
BM_UFlatMedley                 1738983 ns      1738950 ns          403 bytes_per_second=1.56844G/s
BM_UValidate/0                   24227 ns        24227 ns        28890 bytes_per_second=3.93646G/s html
BM_UValidate/1                  268613 ns       268608 ns         2606 bytes_per_second=2.43429G/s urls
BM_UValidate/2                     145 ns          145 ns      4721927 bytes_per_second=790.638G/s jpg
BM_UValidate/3                    98.2 ns         98.2 ns      7089287 bytes_per_second=1.89654G/s jpg_200
BM_UValidate/4                    2411 ns         2411 ns       290391 bytes_per_second=39.5581G/s pdf
BM_UValidate/5                   98246 ns        98244 ns         7125 bytes_per_second=3.88288G/s html4
BM_UValidate/6                  107937 ns       107935 ns         6485 bytes_per_second=1.31231G/s txt1
BM_UValidate/7                   94753 ns        94751 ns         7386 bytes_per_second=1.23041G/s txt2
BM_UValidate/8                  284073 ns       284061 ns         2464 bytes_per_second=1.39916G/s txt3
BM_UValidate/9                  393234 ns       393226 ns         1780 bytes_per_second=1.14125G/s txt4
BM_UValidate/10                  21127 ns        21127 ns        33140 bytes_per_second=5.2276G/s pb
BM_UValidate/11                 104744 ns       104743 ns         6661 bytes_per_second=1.63889G/s gaviota
BM_UValidateMedley             1404832 ns      1404792 ns          498 bytes_per_second=1.94153G/s
BM_UIOVecSource/0               100416 ns       100410 ns         6875 bytes_per_second=972.58M/s html (22.24 %)
BM_UIOVecSource/1              1356688 ns      1356630 ns          514 bytes_per_second=493.548M/s urls (47.84 %)
BM_UIOVecSource/2                11285 ns        11284 ns        62025 bytes_per_second=10.1591G/s jpg (99.95 %)
BM_UIOVecSource/3                  403 ns          403 ns      1737215 bytes_per_second=473.248M/s jpg_200 (73.00 %)
BM_UIOVecSource/4                15480 ns        15479 ns        44600 bytes_per_second=6.16095G/s pdf (83.31 %)
BM_UIOVecSource/5               443525 ns       443526 ns         1579 bytes_per_second=880.726M/s html4 (22.52 %)
BM_UIOVecSource/6               416887 ns       416888 ns         1674 bytes_per_second=347.919M/s txt1 (57.87 %)
BM_UIOVecSource/7               367795 ns       367788 ns         1897 bytes_per_second=324.589M/s txt2 (62.02 %)
BM_UIOVecSource/8              1122323 ns      1122296 ns          621 bytes_per_second=362.635M/s txt3 (55.17 %)
BM_UIOVecSource/9              1521115 ns      1521100 ns          459 bytes_per_second=302.109M/s txt4 (66.41 %)
BM_UIOVecSource/10               90887 ns        90884 ns         7284 bytes_per_second=1.21522G/s pb (19.61 %)
BM_UIOVecSource/11              355515 ns       355494 ns         1955 bytes_per_second=494.47M/s gaviota (37.73 %)
BM_UIOVecSink/0                 102119 ns       102117 ns         6789 bytes_per_second=956.317M/s html
BM_UIOVecSink/1                 954436 ns       954395 ns          732 bytes_per_second=701.557M/s urls
BM_UIOVecSink/2                   4269 ns         4268 ns       163990 bytes_per_second=26.8576G/s jpg
BM_UIOVecSink/3                    322 ns          322 ns      2176041 bytes_per_second=592.941M/s jpg_200
BM_UIOVecSink/4                   9806 ns         9806 ns        70545 bytes_per_second=9.72553G/s pdf
BM_UFlatSink/0                   30576 ns        30575 ns        22895 bytes_per_second=3.11908G/s html
BM_UFlatSink/1                  333363 ns       333353 ns         2100 bytes_per_second=1.96149G/s urls
BM_UFlatSink/2                    3912 ns         3912 ns       179331 bytes_per_second=29.3047G/s jpg
BM_UFlatSink/3                     164 ns          164 ns      4279427 bytes_per_second=1.13813G/s jpg_200
BM_UFlatSink/4                    5887 ns         5887 ns       118969 bytes_per_second=16.1991G/s pdf
BM_UFlatSink/5                  123255 ns       123252 ns         5678 bytes_per_second=3.09504G/s html4
BM_UFlatSink/6                  131130 ns       131125 ns         5337 bytes_per_second=1106.15M/s txt1
BM_UFlatSink/7                  115091 ns       115091 ns         6082 bytes_per_second=1037.27M/s txt2
BM_UFlatSink/8                  344551 ns       344539 ns         2032 bytes_per_second=1.15356G/s txt3
BM_UFlatSink/9                  476420 ns       476406 ns         1470 bytes_per_second=964.595M/s txt4
BM_UFlatSink/10                  26246 ns        26245 ns        26672 bytes_per_second=4.20819G/s pb
BM_UFlatSink/11                 140879 ns       140876 ns         4969 bytes_per_second=1.21853G/s gaviota
BM_ZFlat/0                       96401 ns        96399 ns         7125 bytes_per_second=1013.05M/s html (22.24 %)
BM_ZFlat/1                     1340286 ns      1340246 ns          521 bytes_per_second=499.582M/s urls (47.84 %)
BM_ZFlat/2                        6642 ns         6642 ns       104924 bytes_per_second=17.2593G/s jpg (99.95 %)
BM_ZFlat/3                         328 ns          328 ns      2132317 bytes_per_second=581.917M/s jpg_200 (73.00 %)
BM_ZFlat/4                       11717 ns        11717 ns        59042 bytes_per_second=8.13935G/s pdf (83.31 %)
BM_ZFlat/5                      426496 ns       426486 ns         1640 bytes_per_second=915.914M/s html4 (22.52 %)
BM_ZFlat/6                      410613 ns       410608 ns         1699 bytes_per_second=353.241M/s txt1 (57.87 %)
BM_ZFlat/7                      362311 ns       362306 ns         1925 bytes_per_second=329.501M/s txt2 (62.02 %)
BM_ZFlat/8                     1107257 ns      1107210 ns          630 bytes_per_second=367.577M/s txt3 (55.17 %)
BM_ZFlat/9                     1501667 ns      1501631 ns          465 bytes_per_second=306.026M/s txt4 (66.41 %)
BM_ZFlat/10                      84528 ns        84526 ns         7952 bytes_per_second=1.30663G/s pb (19.61 %)
BM_ZFlat/11                     344650 ns       344638 ns         2015 bytes_per_second=510.046M/s gaviota (37.73 %)
BM_ZFlatAll                    5899360 ns      5899251 ns          117 bytes_per_second=473.433M/s 12 kTestDataFiles
BM_ZFlatIncreasingTableSize      33096 ns        33095 ns        20889 bytes_per_second=936.885M/s 7 tables

Old Perf:

BM_UFlat/0                       69285 ns        69281 ns        10087 bytes_per_second=1.37653G/s html
BM_UFlat/1                      766096 ns       766034 ns          912 bytes_per_second=874.063M/s urls
BM_UFlat/2                        4035 ns         4034 ns       173510 bytes_per_second=28.4167G/s jpg
BM_UFlat/3                         187 ns          187 ns      3735588 bytes_per_second=1018.67M/s jpg_200
BM_UFlat/4                       10363 ns        10362 ns        67529 bytes_per_second=9.20333G/s pdf
BM_UFlat/5                      281146 ns       281130 ns         2490 bytes_per_second=1.35692G/s html4
BM_UFlat/6                      268610 ns       268601 ns         2606 bytes_per_second=539.996M/s txt1
BM_UFlat/7                      239866 ns       239857 ns         2918 bytes_per_second=497.713M/s txt2
BM_UFlat/8                      708861 ns       708790 ns          988 bytes_per_second=574.196M/s txt3
BM_UFlat/9                      997750 ns       997631 ns          702 bytes_per_second=460.63M/s txt4
BM_UFlat/10                      64658 ns        64652 ns        10825 bytes_per_second=1.70827G/s pb
BM_UFlat/11                     246468 ns       246451 ns         2841 bytes_per_second=713.251M/s gaviota
BM_UFlatMedley                 3675635 ns      3675264 ns          190 bytes_per_second=759.919M/s
BM_UValidate/0                   21020 ns        21018 ns        32769 bytes_per_second=4.53746G/s html
BM_UValidate/1                  343949 ns       343920 ns         2032 bytes_per_second=1.90122G/s urls
BM_UValidate/2                     138 ns          138 ns      5060369 bytes_per_second=828.669G/s jpg
BM_UValidate/3                    96.5 ns         96.5 ns      7259557 bytes_per_second=1.93099G/s jpg_200
BM_UValidate/4                    1971 ns         1971 ns       355117 bytes_per_second=48.3824G/s pdf
BM_UValidate/5                  107708 ns       107704 ns         6453 bytes_per_second=3.54185G/s html4
BM_UValidate/6                  126669 ns       126662 ns         5503 bytes_per_second=1.11829G/s txt1
BM_UValidate/7                  115561 ns       115555 ns         6048 bytes_per_second=1033.1M/s txt2
BM_UValidate/8                  335215 ns       335177 ns         2084 bytes_per_second=1.18578G/s txt3
BM_UValidate/9                  485964 ns       485937 ns         1440 bytes_per_second=945.675M/s txt4
BM_UValidate/10                  17975 ns        17974 ns        38397 bytes_per_second=6.14461G/s pb
BM_UValidate/11                  92493 ns        92484 ns         7493 bytes_per_second=1.85612G/s gaviota
BM_UValidateMedley             1698160 ns      1697991 ns          412 bytes_per_second=1.60628G/s
BM_UIOVecSource/0               110784 ns       110774 ns         6189 bytes_per_second=881.582M/s html (22.24 %)
BM_UIOVecSource/1              1404417 ns      1404231 ns          497 bytes_per_second=476.818M/s urls (47.84 %)
BM_UIOVecSource/2                11290 ns        11290 ns        61909 bytes_per_second=10.1543G/s jpg (99.95 %)
BM_UIOVecSource/3                  401 ns          401 ns      1740294 bytes_per_second=475.167M/s jpg_200 (73.00 %)
BM_UIOVecSource/4                15506 ns        15505 ns        44572 bytes_per_second=6.15058G/s pdf (83.31 %)
BM_UIOVecSource/5               468629 ns       468574 ns         1489 bytes_per_second=833.645M/s html4 (22.52 %)
BM_UIOVecSource/6               435987 ns       435960 ns         1602 bytes_per_second=332.699M/s txt1 (57.87 %)
BM_UIOVecSource/7               379021 ns       378988 ns         1842 bytes_per_second=314.997M/s txt2 (62.02 %)
BM_UIOVecSource/8              1176435 ns      1176304 ns          593 bytes_per_second=345.986M/s txt3 (55.17 %)
BM_UIOVecSource/9              1542281 ns      1542143 ns          453 bytes_per_second=297.987M/s txt4 (66.41 %)
BM_UIOVecSource/10              100702 ns       100698 ns         6639 bytes_per_second=1123.1M/s pb (19.61 %)
BM_UIOVecSource/11              378524 ns       378494 ns         1843 bytes_per_second=464.423M/s gaviota (37.73 %)
BM_UIOVecSink/0                 104544 ns       104537 ns         6634 bytes_per_second=934.175M/s html
BM_UIOVecSink/1                 953243 ns       953153 ns          732 bytes_per_second=702.471M/s urls
BM_UIOVecSink/2                   4278 ns         4278 ns       163680 bytes_per_second=26.7987G/s jpg
BM_UIOVecSink/3                    321 ns          321 ns      2178453 bytes_per_second=593.574M/s jpg_200
BM_UIOVecSink/4                   9939 ns         9939 ns        69671 bytes_per_second=9.59571G/s pdf
BM_UFlatSink/0                   69330 ns        69325 ns        10097 bytes_per_second=1.37566G/s html
BM_UFlatSink/1                  766792 ns       766679 ns          911 bytes_per_second=873.328M/s urls
BM_UFlatSink/2                    3989 ns         3989 ns       175402 bytes_per_second=28.7414G/s jpg
BM_UFlatSink/3                     193 ns          193 ns      3632512 bytes_per_second=990.14M/s jpg_200
BM_UFlatSink/4                   10414 ns        10413 ns        67175 bytes_per_second=9.15846G/s pdf
BM_UFlatSink/5                  281174 ns       281148 ns         2490 bytes_per_second=1.35683G/s html4
BM_UFlatSink/6                  268597 ns       268580 ns         2606 bytes_per_second=540.038M/s txt1
BM_UFlatSink/7                  239854 ns       239832 ns         2919 bytes_per_second=497.765M/s txt2
BM_UFlatSink/8                  709052 ns       709044 ns          987 bytes_per_second=573.99M/s txt3
BM_UFlatSink/9                  998252 ns       998199 ns          701 bytes_per_second=460.368M/s txt4
BM_UFlatSink/10                  64567 ns        64561 ns        10834 bytes_per_second=1.71068G/s pb
BM_UFlatSink/11                 246419 ns       246401 ns         2840 bytes_per_second=713.396M/s gaviota
BM_ZFlat/0                      105792 ns       105784 ns         6501 bytes_per_second=923.169M/s html (22.24 %)
BM_ZFlat/1                     1385136 ns      1385007 ns          503 bytes_per_second=483.436M/s urls (47.84 %)
BM_ZFlat/2                        6665 ns         6665 ns       104897 bytes_per_second=17.2005G/s jpg (99.95 %)
BM_ZFlat/3                         333 ns          333 ns      2100000 bytes_per_second=572.702M/s jpg_200 (73.00 %)
BM_ZFlat/4                       11734 ns        11733 ns        58718 bytes_per_second=8.12806G/s pdf (83.31 %)
BM_ZFlat/5                      454034 ns       454014 ns         1539 bytes_per_second=860.381M/s html4 (22.52 %)
BM_ZFlat/6                      429958 ns       429920 ns         1622 bytes_per_second=337.373M/s txt1 (57.87 %)
BM_ZFlat/7                      374359 ns       374345 ns         1865 bytes_per_second=318.904M/s txt2 (62.02 %)
BM_ZFlat/8                     1161529 ns      1161412 ns          601 bytes_per_second=350.422M/s txt3 (55.17 %)
BM_ZFlat/9                     1522178 ns      1522169 ns          459 bytes_per_second=301.897M/s txt4 (66.41 %)
BM_ZFlat/10                      98833 ns        98829 ns         6839 bytes_per_second=1.11752G/s pb (19.61 %)
BM_ZFlat/11                     371033 ns       371013 ns         1882 bytes_per_second=473.787M/s gaviota (37.73 %)
BM_ZFlatAll                    6109015 ns      6108341 ns          113 bytes_per_second=457.228M/s 12 kTestDataFiles
BM_ZFlatIncreasingTableSize      34125 ns        34123 ns        20220 bytes_per_second=908.652M/s 7 tables

Most tests are either the same or better (often MUCH better). The only regressions are some of the validation tests, especially the ones that are already well over 1GB/s. Some of the validations are also faster, including the ValidateMedley test (1.60 -> 1.94GB/s). I tracked that down to a single change, and there is already a comment saying that the intent is to get the new codegen.

RedBeard0531 · 2024-01-18T15:25:54Z

snappy-internal.h

@@ -233,9 +233,7 @@ static inline std::pair<size_t, bool> FindMatchLength(const char* s1,
      int shift = Bits::FindLSBSetNonZero64(xorval);
      size_t matched_bytes = shift >> 3;
      uint64_t a3 = UNALIGNED_LOAD64(s2 + 4);
-#ifndef __x86_64__


This block was moved to #else to cleanly add an aarch64 case.

RedBeard0531 · 2024-01-18T15:27:48Z

snappy-internal.h

+#elif defined(__aarch64__)
+      asm("cmp %w[xorval], 0\n\t"
+          "csel %x[a2], %[a3], %[a2], eq\n\t"
+          : [a2] "+r" (a2)
+          : [a3] "r" (a3) , [xorval] "r" (xorval)
+          : "cc");


I'm not sure if this is what clang was generating for the default code on aarch64, but this the direct translation from the x86 asm. I verified that it is faster.

What does the generated code look before/after? Both GCC and LLVM should use CSEL for this, so I don't understand why this would help. Please file bug reports if not, otherwise it will never get fixed!

Note using random asm like this is generally a bad idea. If absolutely needed, it would be best to abstract it in a generic macro or inline function.

Note using random asm like this is generally a bad idea. If absolutely needed, it would be best to abstract it in a generic macro or inline function.

I agree! However, this code was already using inline asm like this right here for x86, and I am trying to follow local style. All I'm doing is adding a direct translation of the x86 asm to aarch64.

Both GCC and LLVM should use CSEL for this, so I don't understand why this would help. Please file bug reports if not, otherwise it will never get fixed!

The code already has a comment right above in the x86 case describing why clang doesn't use cmov:

snappy/snappy-internal.h

Lines 237 to 245 in 5ec5d16

// Ideally this would just be

//

// a2 = static_cast<uint32_t>(xorval) == 0 ? a3 : a2;

//

// However clang correctly infers that the above statement participates on

// a critical data dependency chain and thus, unfortunately, refuses to

// use a conditional move (it's tuned to cut data dependencies). In this

// case there is a longer parallel chain anyway AND this will be fairly

// unpredictable.

On a hunch that the same reasoning also applied to gcc and aarch64, I tried doing the direct port and observed improved perf. Here is the compression perf without this change (since this isn't called in the decompression path):

BM_ZFlat/0 105175 ns 105174 ns 6539 bytes_per_second=928.519M/s html (22.24 %) BM_ZFlat/1 1385523 ns 1385503 ns 504 bytes_per_second=483.263M/s urls (47.84 %) BM_ZFlat/2 6700 ns 6699 ns 104479 bytes_per_second=17.1121G/s jpg (99.95 %) BM_ZFlat/3 333 ns 333 ns 2101538 bytes_per_second=572.801M/s jpg_200 (73.00 %) BM_ZFlat/4 11754 ns 11754 ns 58783 bytes_per_second=8.11358G/s pdf (83.31 %) BM_ZFlat/5 453779 ns 453780 ns 1540 bytes_per_second=860.825M/s html4 (22.52 %) BM_ZFlat/6 430039 ns 430031 ns 1623 bytes_per_second=337.286M/s txt1 (57.87 %) BM_ZFlat/7 374322 ns 374315 ns 1865 bytes_per_second=318.929M/s txt2 (62.02 %) BM_ZFlat/8 1159839 ns 1159756 ns 601 bytes_per_second=350.923M/s txt3 (55.17 %) BM_ZFlat/9 1523393 ns 1523306 ns 459 bytes_per_second=301.672M/s txt4 (66.41 %) BM_ZFlat/10 97944 ns 97940 ns 6771 bytes_per_second=1.12766G/s pb (19.61 %) BM_ZFlat/11 370635 ns 370624 ns 1881 bytes_per_second=474.284M/s gaviota (37.73 %) BM_ZFlatAll 6184921 ns 6184674 ns 111 bytes_per_second=451.585M/s 12 kTestDataFiles

The full results with the change are in the PR description, but to summarize, BM_ZFlatAll shows 473.433M/s, a 4.8% improvement. The pb case (protobufs?) was the most improved at 16%.

Based on that comment, I'm not sure that a bug report would necessarily be helpful here. It appears that while compilers are able to generate csel/cmov here, they currently decide not to because they have reason to suspect that in the general case it would be more harmful than helpful. However, in this specific case, there is local knowledge that it is worth using cmov/csel here, and with a fairly large end-to-end impact. Extending from what is said in the comment, (some with my own wild guesses) one of the issues is that the desired code here adds an unconditional dependency on a load that the compiler would otherwise move inside the branch and therefore could be avoided entirely. In this case it is very very likely to already be in L1, but the compiler may not know that, and if it wasn't that would be very bad for perf. If that is part of the reason, then it seems totally reasonable for the compiler to be conservative about adding data dependencies to loads.

That said, I do plan to file some gcc bugs after this. In particular, the codegen for memmove-by-memcpy-through-temp-array is really bad and has even regressed from the last release.

What does the generated code look before/after?

It is a bit hard to compare because it this function is inlined into a larger one, and this change changes register allocations around and renumbers labels, but here is the relevant bit (with a few common instructions before and after):

rbit x6, x1 clz x6, x6 asr w2, w6, 3 cbnz w1, .L79 ldr x5, [x9, 8] .L79: add w2, w2, 4 and w6, w6, 24 sub x4, x9, x4 sxtw x2, w2 add x1, x9, x2 lsr x6, x5, x6 .L80: lsl w5, w4, 8

rbit x1, x8 clz x1, x1 and w7, w1, 24 lsr x1, x1, 3 add x1, x1, 4 ldr x14, [x4, 8] add x4, x4, x1 cmp w8, 0 csel x2, x14, x2, eq lsr x7, x2, x7 .L79: lsl w2, w11, 8

Getting back to this, GCC14 fixes the memmove-by-memcpy-through-temp-array issue (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618): https://www.godbolt.org/z/qMraefr1K

There is a redundant stack reservation for larger sizes but this won't be an issue in non-trivial functions.

RedBeard0531 · 2024-01-18T15:31:10Z

snappy.cc

+// GCC dispatches to libc for memmoves > 16 bytes, so we need to
+// do some work to get good code from that compiler. Clang handles
+// powers-of-2 at least up to 64 well.


Sigh. This is really annoying and I don't feel good about this code. However, the perf impact is massive.

https://www.godbolt.org/z/EjP9zv3e7

It looks like gcc is better for aarch64 in trunk, but not in any released version (as of today), and not for x86_64 either.

RedBeard0531 · 2024-01-18T15:32:23Z

snappy.cc

+#elif defined(__aarch64__)
+  // Emperically it is faster to just copy all 64 rather than branching.
+  (void)kShortMemCopy;
+  (void)size;
+  FixedSizeMemMove<64>(dst, src);


This is the only novel part of the patch that isn't just massaging the code to do the same thing slightly differently. If you'd prefer, I can pull it out to a separate PR. I also tried move32(); if (size>32) moveAnother32() (rather than dispatching to variable-length memmove in that case) and the unconditional version was still faster.

Note that this makes the FixedSizeMemMove<32> specialization dead code. So it may make sense to remove it if you keep this change. Alternatively, maybe keep it in case you add another fixed-size memmove of 32 bytes.

It seems to me all this is overengineering - we know that src < dst since it's typical for a decompressor to copy earlier bytes to append to the end of a buffer. Writing more bytes than requested would be impossible if dst < src since it will corrupt src.

So using memcpy is safe even if there is overlap beyond the requested copy size. The order of the emitted loads and stores doesn't matter.

It seems to me all this is overengineering

Are you referring to my change right here to unconditionally copy 64 bytes, the existing code (which was already calling memmove), or my introduction of FixedSizeMemMove?

My only change at this location was to stop branching on size > 32 and just unconditionally copy all 64 bytes. The second copy is cheap enough to not be worth the overhead of making it conditional. This is of course cpu-dependent, but at least seems likely to be the case for any modern CPU core that can have multiple loads/stores in flight concurrently. I only did this change on aarch64 since that is what I was tuning on, but it is probably also worth testing a similar change for the x86_64 version immediately above this.

So using memcpy is safe even if there is overlap beyond the requested copy size.

I had considered just using memcpy because in practice it should be fine for this use case, however there were a few reasons I didn't:

The justification for safety relies on memcpy copying bytes from front to back, but that isn't guaranteed. It would not be safe if some implementation decided to copy from back to front.

According to the spec, it is immediate UB to call memcpy with overlapping buffers, not simply an unspecified result. This means that compilers could nasal demon you. That is probably unlikely here, but more problematic is that it would make it impossible to use UBSAN with snappy.

I suspect (admittedly without testing) that the codegen for memcpy (which interleaves loads and stores) will have worse performance in this case. If there is overlap, a later load is no longer independent of earlier stores. This can trigger a store-to-load hazard in the CPU and prevent them from executing fully in parallel. To make it worse, this code will often be dealing with unaligned loads/stores* which are fast in the normal case, but have an even higher load-store penalty, often requiring a full flush of the store buffer. This is completely avoided by doing all loads before the stores. (Of course you could still have overlap with a prior call to MemCopy64, but that is much less problematic than having it within a single call)

* In addition to having adresses that are not naturally aligned, the potentially bigger problem is that the loads would often be differently aligned from the preceding stores, meaning that the load would need to pull from multiple stores which CPUs tend to punt on (or place severe restrictions on)

I'm referring to all the optimized code that does the copy, both your addition and the existing code. It's correct to use memcpy here. The order of the loads/stores does not matter, any order will work (ie. your point 1 does not hold). As it happens on AArch64 all loads are emitted before all stores so store-load hazards cannot occur (ie. point 3 is not an issue): https://godbolt.org/z/hva13E3z8

So it would be fine to prefer memmove on targets that emit it inline and use memcpy otherwise as the fallback (with UBSAN still prefer memmove). If people complained about inefficient codegen (x86-64 memcpy and memmove look bad in GCC) then there would be no need for any of these hacks.

If people complained about inefficient codegen (x86-64 memcpy and memmove look bad in GCC) then there would be no need for any of these hacks

Just to be clear, you are preaching to the choir. I have reported several bad codegen bugs to both gcc and clang. However, even if they were all fixed in trunk tomorrow, that doesn't help me since I need to get good perf out of the compiler we use in production. And there can be over a year of lag between something hitting gcc truck and it being in a released compiler, since perf improvements are often not backported. Then I need to wait for us to upgrade our toolchain (which is why I'm testing against gcc 11.2 🙁) So there is still a need for perf hacks to convince current compilers to generate good code. I'll leave it up to the snappy maintainers whether they want to wait for compilers to improve or merge a change to get good perf out of existing compilers.

As it happens on AArch64 all loads are emitted before all stores so store-load hazards cannot occur

It looks like gcc has changed their codegen on trunk vs latest release. On released versions it does the interleaved loads and stores for memcpy. I'm not sure that doing the loads upfront in memcpy is always better especially after inlining into real code since that requires more registers (more of a problem for longer memcpys and when inlined into functions already under register pressure).

The order of the loads/stores does not matter, any order will work (ie. your point 1 does not hold)
Hmm, I think you're right, my bad.

I still don't love the idea of intentionally calling memcpy with overlapping ranges. But if that is what the snappy maintainers decide to do, then so be it.

RedBeard0531 · 2024-01-18T15:40:36Z

snappy.cc

+  const size_t next_tag = is_literal ? (*tag >> 2) + 1 : tag_type;
+  *tag = ip[next_tag];
+  ip += (next_tag) + 1;


This is the change that caused the regression on validate. I checked that with this change it is generating the desired csinc instruction, which the comment above says is the intended codegen. Here is the perf with just this change reverted:

BM_UFlat/0 36099 ns 36098 ns 19334 bytes_per_second=2.64188G/s html BM_UFlat/1 457307 ns 457297 ns 1530 bytes_per_second=1.42986G/s urls BM_UFlat/2 3956 ns 3956 ns 176952 bytes_per_second=28.9819G/s jpg BM_UFlat/3 160 ns 160 ns 4394167 bytes_per_second=1.16619G/s jpg_200 BM_UFlat/4 5788 ns 5788 ns 120704 bytes_per_second=16.478G/s pdf BM_UFlat/5 155126 ns 155115 ns 4509 bytes_per_second=2.45927G/s html4 BM_UFlat/6 171845 ns 171839 ns 4073 bytes_per_second=844.067M/s txt1 BM_UFlat/7 154670 ns 154670 ns 4525 bytes_per_second=771.836M/s txt2 BM_UFlat/8 452391 ns 452362 ns 1547 bytes_per_second=899.687M/s txt3 BM_UFlat/9 649864 ns 649864 ns 1077 bytes_per_second=707.13M/s txt4 BM_UFlat/10 31264 ns 31263 ns 22349 bytes_per_second=3.53272G/s pb BM_UFlat/11 148201 ns 148197 ns 4721 bytes_per_second=1.15833G/s gaviota BM_UFlatMedley 2295899 ns 2295829 ns 305 bytes_per_second=1.188G/s BM_UValidate/0 20800 ns 20799 ns 33000 bytes_per_second=4.5851G/s html BM_UValidate/1 345447 ns 345427 ns 2020 bytes_per_second=1.89293G/s urls BM_UValidate/2 138 ns 138 ns 5059972 bytes_per_second=828.941G/s jpg BM_UValidate/3 96.5 ns 96.5 ns 7215460 bytes_per_second=1.93108G/s jpg_200 BM_UValidate/4 1972 ns 1972 ns 354938 bytes_per_second=48.3615G/s pdf BM_UValidate/5 107941 ns 107936 ns 6443 bytes_per_second=3.53424G/s html4 BM_UValidate/6 126895 ns 126888 ns 5501 bytes_per_second=1.11629G/s txt1 BM_UValidate/7 114857 ns 114856 ns 6066 bytes_per_second=1039.39M/s txt2 BM_UValidate/8 334870 ns 334859 ns 2087 bytes_per_second=1.18691G/s txt3 BM_UValidate/9 486272 ns 486261 ns 1443 bytes_per_second=945.045M/s txt4 BM_UValidate/10 17879 ns 17878 ns 38501 bytes_per_second=6.17749G/s pb BM_UValidate/11 92356 ns 92352 ns 7539 bytes_per_second=1.85876G/s gaviota BM_UValidateMedley 1697835 ns 1697754 ns 412 bytes_per_second=1.6065G/s BM_UIOVecSource/0 103097 ns 103092 ns 6599 bytes_per_second=947.274M/s html (22.24 %) BM_UIOVecSource/1 1363926 ns 1363884 ns 512 bytes_per_second=490.923M/s urls (47.84 %) BM_UIOVecSource/2 11296 ns 11296 ns 61854 bytes_per_second=10.1487G/s jpg (99.95 %) BM_UIOVecSource/3 402 ns 402 ns 1740237 bytes_per_second=475.039M/s jpg_200 (73.00 %) BM_UIOVecSource/4 15463 ns 15463 ns 44743 bytes_per_second=6.16753G/s pdf (83.31 %) BM_UIOVecSource/5 442147 ns 442130 ns 1580 bytes_per_second=883.507M/s html4 (22.52 %) BM_UIOVecSource/6 416784 ns 416773 ns 1672 bytes_per_second=348.015M/s txt1 (57.87 %) BM_UIOVecSource/7 367446 ns 367440 ns 1897 bytes_per_second=324.897M/s txt2 (62.02 %) BM_UIOVecSource/8 1121587 ns 1121523 ns 622 bytes_per_second=362.885M/s txt3 (55.17 %) BM_UIOVecSource/9 1522044 ns 1521967 ns 459 bytes_per_second=301.937M/s txt4 (66.41 %) BM_UIOVecSource/10 90576 ns 90574 ns 7349 bytes_per_second=1.21938G/s pb (19.61 %) BM_UIOVecSource/11 354641 ns 354634 ns 1954 bytes_per_second=495.669M/s gaviota (37.73 %) BM_UIOVecSink/0 102030 ns 102026 ns 6793 bytes_per_second=957.168M/s html BM_UIOVecSink/1 954178 ns 954179 ns 732 bytes_per_second=701.715M/s urls BM_UIOVecSink/2 4264 ns 4263 ns 164116 bytes_per_second=26.8888G/s jpg BM_UIOVecSink/3 321 ns 321 ns 2178205 bytes_per_second=593.638M/s jpg_200 BM_UIOVecSink/4 9772 ns 9772 ns 70791 bytes_per_second=9.75947G/s pdf BM_UFlatSink/0 36097 ns 36096 ns 19360 bytes_per_second=2.64205G/s html BM_UFlatSink/1 457191 ns 457142 ns 1530 bytes_per_second=1.43034G/s urls BM_UFlatSink/2 3920 ns 3919 ns 178767 bytes_per_second=29.2498G/s jpg BM_UFlatSink/3 165 ns 165 ns 4253941 bytes_per_second=1.13175G/s jpg_200 BM_UFlatSink/4 5820 ns 5820 ns 119850 bytes_per_second=16.3866G/s pdf BM_UFlatSink/5 154998 ns 154992 ns 4515 bytes_per_second=2.46122G/s html4 BM_UFlatSink/6 171897 ns 171891 ns 4075 bytes_per_second=843.809M/s txt1 BM_UFlatSink/7 154886 ns 154878 ns 4513 bytes_per_second=770.801M/s txt2 BM_UFlatSink/8 452911 ns 452904 ns 1545 bytes_per_second=898.611M/s txt3 BM_UFlatSink/9 649481 ns 649457 ns 1075 bytes_per_second=707.573M/s txt4 BM_UFlatSink/10 31247 ns 31246 ns 22371 bytes_per_second=3.53464G/s pb BM_UFlatSink/11 148706 ns 148703 ns 4701 bytes_per_second=1.15439G/s gaviota BM_ZFlat/0 96115 ns 96112 ns 7178 bytes_per_second=1016.07M/s html (22.24 %) BM_ZFlat/1 1339282 ns 1339210 ns 521 bytes_per_second=499.968M/s urls (47.84 %) BM_ZFlat/2 6655 ns 6654 ns 104791 bytes_per_second=17.2275G/s jpg (99.95 %) BM_ZFlat/3 328 ns 328 ns 2136911 bytes_per_second=580.819M/s jpg_200 (73.00 %) BM_ZFlat/4 11728 ns 11728 ns 58781 bytes_per_second=8.13163G/s pdf (83.31 %) BM_ZFlat/5 424216 ns 424212 ns 1653 bytes_per_second=920.824M/s html4 (22.52 %) BM_ZFlat/6 411154 ns 411134 ns 1696 bytes_per_second=352.789M/s txt1 (57.87 %) BM_ZFlat/7 362120 ns 362110 ns 1925 bytes_per_second=329.679M/s txt2 (62.02 %) BM_ZFlat/8 1106496 ns 1106418 ns 630 bytes_per_second=367.84M/s txt3 (55.17 %) BM_ZFlat/9 1501500 ns 1501436 ns 465 bytes_per_second=306.066M/s txt4 (66.41 %) BM_ZFlat/10 86265 ns 86262 ns 7658 bytes_per_second=1.28032G/s pb (19.61 %) BM_ZFlat/11 347773 ns 347764 ns 1998 bytes_per_second=505.462M/s gaviota (37.73 %) BM_ZFlatAll 5898864 ns 5898386 ns 117 bytes_per_second=473.503M/s 12 kTestDataFiles BM_ZFlatIncreasingTableSize 33035 ns 33035 ns 20928 bytes_per_second=938.586M/s 7 tables

It is a mixed bag for Validate, but even in the cases where it regresses there, it is still faster for actual decompression.

RedBeard0531 · 2024-01-18T15:45:33Z

Obviously, all of my claims of "faster" are only on the core (graviton2/neoverse-n1) and compiler (gcc-11.2) that I tested on. I assume that you will be able to run this through a gauntlet of platforms that you care about prior to merging to ensure there are no regressions there.

erin2722 · 2024-02-13T15:40:13Z

snappy.cc

+  }
+}
+
+#ifdef __aarch64__ // Implies neon support


Suggested change

#ifdef __aarch64__ // Implies neon support

#if SNAPPY_HAVE_VECTOR_BYTE_SHUFFLE

Just in case there is a weird ARM platform that doesn't have the arm_neon header, it may make more sense to ifdef this on the same condition that V128 is defined on, on line 94 a few lines up.

danilak-G · 2024-04-08T13:37:08Z

Hi, sorry for a long delay. I promise to take a look and hopefully merge it this week. At Google we don't care much about gcc and thus this was overlooked but we understand the need for the community

Perf tuning for gcc + aarch64

5ec5d16

RedBeard0531 commented Jan 18, 2024

View reviewed changes

RedBeard0531 marked this pull request as ready for review January 18, 2024 15:41

erin2722 mentioned this pull request Jan 31, 2024

Mathias's Patch: Perf tuning for gcc + aarch64 mongodb-forks/snappy#2

Merged

erin2722 reviewed Feb 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf tuning for gcc + aarch64 #176

Perf tuning for gcc + aarch64 #176

RedBeard0531 commented Jan 18, 2024

RedBeard0531 Jan 18, 2024

RedBeard0531 Jan 18, 2024

Wilco1 Jan 18, 2024

RedBeard0531 Jan 19, 2024

Wilco1 Mar 8, 2024

RedBeard0531 Jan 18, 2024

RedBeard0531 Jan 18, 2024

Wilco1 Jan 18, 2024

RedBeard0531 Jan 19, 2024

Wilco1 Jan 19, 2024

RedBeard0531 Jan 19, 2024

RedBeard0531 Jan 18, 2024

RedBeard0531 commented Jan 18, 2024

erin2722 Feb 13, 2024

danilak-G commented Apr 8, 2024

	// Ideally this would just be
	//
	// a2 = static_cast<uint32_t>(xorval) == 0 ? a3 : a2;
	//
	// However clang correctly infers that the above statement participates on
	// a critical data dependency chain and thus, unfortunately, refuses to
	// use a conditional move (it's tuned to cut data dependencies). In this
	// case there is a longer parallel chain anyway AND this will be fairly
	// unpredictable.

	#ifdef __aarch64__ // Implies neon support
	#if SNAPPY_HAVE_VECTOR_BYTE_SHUFFLE

Perf tuning for gcc + aarch64 #176

Are you sure you want to change the base?

Perf tuning for gcc + aarch64 #176

Conversation

RedBeard0531 commented Jan 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RedBeard0531 commented Jan 18, 2024

Choose a reason for hiding this comment

danilak-G commented Apr 8, 2024