Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf tuning for gcc + aarch64 #176

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

RedBeard0531
Copy link

The current codebase appears to have been tuned under clang, and there are a handful of patterns that gcc doesn't recognize that are critical to perf. Benchmarks were run on a graviton2/neoverse-n1 core with gcc-11.2.

New Perf:

2024-01-18T15:16:45+00:00
Running build/snappy_benchmark
Run on (16 X 243.75 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x16)
  L1 Instruction 64 KiB (x16)
  L2 Unified 1024 KiB (x16)
  L3 Unified 32768 KiB (x1)
Load Average: 0.14, 0.45, 0.44
--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
BM_UFlat/0                       30579 ns        30578 ns        22886 bytes_per_second=3.11878G/s html
BM_UFlat/1                      332963 ns       332964 ns         2102 bytes_per_second=1.96378G/s urls
BM_UFlat/2                        3957 ns         3957 ns       176929 bytes_per_second=28.9678G/s jpg
BM_UFlat/3                         158 ns          158 ns      4420971 bytes_per_second=1.17675G/s jpg_200
BM_UFlat/4                        5864 ns         5864 ns       119168 bytes_per_second=16.2639G/s pdf
BM_UFlat/5                      123479 ns       123474 ns         5669 bytes_per_second=3.08947G/s html4
BM_UFlat/6                      131128 ns       131126 ns         5337 bytes_per_second=1106.14M/s txt1
BM_UFlat/7                      115026 ns       115024 ns         6085 bytes_per_second=1037.87M/s txt2
BM_UFlat/8                      344359 ns       344347 ns         2033 bytes_per_second=1.1542G/s txt3
BM_UFlat/9                      475850 ns       475842 ns         1471 bytes_per_second=965.738M/s txt4
BM_UFlat/10                      26289 ns        26289 ns        26632 bytes_per_second=4.20121G/s pb
BM_UFlat/11                     140067 ns       140067 ns         4995 bytes_per_second=1.22556G/s gaviota
BM_UFlatMedley                 1738983 ns      1738950 ns          403 bytes_per_second=1.56844G/s
BM_UValidate/0                   24227 ns        24227 ns        28890 bytes_per_second=3.93646G/s html
BM_UValidate/1                  268613 ns       268608 ns         2606 bytes_per_second=2.43429G/s urls
BM_UValidate/2                     145 ns          145 ns      4721927 bytes_per_second=790.638G/s jpg
BM_UValidate/3                    98.2 ns         98.2 ns      7089287 bytes_per_second=1.89654G/s jpg_200
BM_UValidate/4                    2411 ns         2411 ns       290391 bytes_per_second=39.5581G/s pdf
BM_UValidate/5                   98246 ns        98244 ns         7125 bytes_per_second=3.88288G/s html4
BM_UValidate/6                  107937 ns       107935 ns         6485 bytes_per_second=1.31231G/s txt1
BM_UValidate/7                   94753 ns        94751 ns         7386 bytes_per_second=1.23041G/s txt2
BM_UValidate/8                  284073 ns       284061 ns         2464 bytes_per_second=1.39916G/s txt3
BM_UValidate/9                  393234 ns       393226 ns         1780 bytes_per_second=1.14125G/s txt4
BM_UValidate/10                  21127 ns        21127 ns        33140 bytes_per_second=5.2276G/s pb
BM_UValidate/11                 104744 ns       104743 ns         6661 bytes_per_second=1.63889G/s gaviota
BM_UValidateMedley             1404832 ns      1404792 ns          498 bytes_per_second=1.94153G/s
BM_UIOVecSource/0               100416 ns       100410 ns         6875 bytes_per_second=972.58M/s html (22.24 %)
BM_UIOVecSource/1              1356688 ns      1356630 ns          514 bytes_per_second=493.548M/s urls (47.84 %)
BM_UIOVecSource/2                11285 ns        11284 ns        62025 bytes_per_second=10.1591G/s jpg (99.95 %)
BM_UIOVecSource/3                  403 ns          403 ns      1737215 bytes_per_second=473.248M/s jpg_200 (73.00 %)
BM_UIOVecSource/4                15480 ns        15479 ns        44600 bytes_per_second=6.16095G/s pdf (83.31 %)
BM_UIOVecSource/5               443525 ns       443526 ns         1579 bytes_per_second=880.726M/s html4 (22.52 %)
BM_UIOVecSource/6               416887 ns       416888 ns         1674 bytes_per_second=347.919M/s txt1 (57.87 %)
BM_UIOVecSource/7               367795 ns       367788 ns         1897 bytes_per_second=324.589M/s txt2 (62.02 %)
BM_UIOVecSource/8              1122323 ns      1122296 ns          621 bytes_per_second=362.635M/s txt3 (55.17 %)
BM_UIOVecSource/9              1521115 ns      1521100 ns          459 bytes_per_second=302.109M/s txt4 (66.41 %)
BM_UIOVecSource/10               90887 ns        90884 ns         7284 bytes_per_second=1.21522G/s pb (19.61 %)
BM_UIOVecSource/11              355515 ns       355494 ns         1955 bytes_per_second=494.47M/s gaviota (37.73 %)
BM_UIOVecSink/0                 102119 ns       102117 ns         6789 bytes_per_second=956.317M/s html
BM_UIOVecSink/1                 954436 ns       954395 ns          732 bytes_per_second=701.557M/s urls
BM_UIOVecSink/2                   4269 ns         4268 ns       163990 bytes_per_second=26.8576G/s jpg
BM_UIOVecSink/3                    322 ns          322 ns      2176041 bytes_per_second=592.941M/s jpg_200
BM_UIOVecSink/4                   9806 ns         9806 ns        70545 bytes_per_second=9.72553G/s pdf
BM_UFlatSink/0                   30576 ns        30575 ns        22895 bytes_per_second=3.11908G/s html
BM_UFlatSink/1                  333363 ns       333353 ns         2100 bytes_per_second=1.96149G/s urls
BM_UFlatSink/2                    3912 ns         3912 ns       179331 bytes_per_second=29.3047G/s jpg
BM_UFlatSink/3                     164 ns          164 ns      4279427 bytes_per_second=1.13813G/s jpg_200
BM_UFlatSink/4                    5887 ns         5887 ns       118969 bytes_per_second=16.1991G/s pdf
BM_UFlatSink/5                  123255 ns       123252 ns         5678 bytes_per_second=3.09504G/s html4
BM_UFlatSink/6                  131130 ns       131125 ns         5337 bytes_per_second=1106.15M/s txt1
BM_UFlatSink/7                  115091 ns       115091 ns         6082 bytes_per_second=1037.27M/s txt2
BM_UFlatSink/8                  344551 ns       344539 ns         2032 bytes_per_second=1.15356G/s txt3
BM_UFlatSink/9                  476420 ns       476406 ns         1470 bytes_per_second=964.595M/s txt4
BM_UFlatSink/10                  26246 ns        26245 ns        26672 bytes_per_second=4.20819G/s pb
BM_UFlatSink/11                 140879 ns       140876 ns         4969 bytes_per_second=1.21853G/s gaviota
BM_ZFlat/0                       96401 ns        96399 ns         7125 bytes_per_second=1013.05M/s html (22.24 %)
BM_ZFlat/1                     1340286 ns      1340246 ns          521 bytes_per_second=499.582M/s urls (47.84 %)
BM_ZFlat/2                        6642 ns         6642 ns       104924 bytes_per_second=17.2593G/s jpg (99.95 %)
BM_ZFlat/3                         328 ns          328 ns      2132317 bytes_per_second=581.917M/s jpg_200 (73.00 %)
BM_ZFlat/4                       11717 ns        11717 ns        59042 bytes_per_second=8.13935G/s pdf (83.31 %)
BM_ZFlat/5                      426496 ns       426486 ns         1640 bytes_per_second=915.914M/s html4 (22.52 %)
BM_ZFlat/6                      410613 ns       410608 ns         1699 bytes_per_second=353.241M/s txt1 (57.87 %)
BM_ZFlat/7                      362311 ns       362306 ns         1925 bytes_per_second=329.501M/s txt2 (62.02 %)
BM_ZFlat/8                     1107257 ns      1107210 ns          630 bytes_per_second=367.577M/s txt3 (55.17 %)
BM_ZFlat/9                     1501667 ns      1501631 ns          465 bytes_per_second=306.026M/s txt4 (66.41 %)
BM_ZFlat/10                      84528 ns        84526 ns         7952 bytes_per_second=1.30663G/s pb (19.61 %)
BM_ZFlat/11                     344650 ns       344638 ns         2015 bytes_per_second=510.046M/s gaviota (37.73 %)
BM_ZFlatAll                    5899360 ns      5899251 ns          117 bytes_per_second=473.433M/s 12 kTestDataFiles
BM_ZFlatIncreasingTableSize      33096 ns        33095 ns        20889 bytes_per_second=936.885M/s 7 tables

Old Perf:

BM_UFlat/0                       69285 ns        69281 ns        10087 bytes_per_second=1.37653G/s html
BM_UFlat/1                      766096 ns       766034 ns          912 bytes_per_second=874.063M/s urls
BM_UFlat/2                        4035 ns         4034 ns       173510 bytes_per_second=28.4167G/s jpg
BM_UFlat/3                         187 ns          187 ns      3735588 bytes_per_second=1018.67M/s jpg_200
BM_UFlat/4                       10363 ns        10362 ns        67529 bytes_per_second=9.20333G/s pdf
BM_UFlat/5                      281146 ns       281130 ns         2490 bytes_per_second=1.35692G/s html4
BM_UFlat/6                      268610 ns       268601 ns         2606 bytes_per_second=539.996M/s txt1
BM_UFlat/7                      239866 ns       239857 ns         2918 bytes_per_second=497.713M/s txt2
BM_UFlat/8                      708861 ns       708790 ns          988 bytes_per_second=574.196M/s txt3
BM_UFlat/9                      997750 ns       997631 ns          702 bytes_per_second=460.63M/s txt4
BM_UFlat/10                      64658 ns        64652 ns        10825 bytes_per_second=1.70827G/s pb
BM_UFlat/11                     246468 ns       246451 ns         2841 bytes_per_second=713.251M/s gaviota
BM_UFlatMedley                 3675635 ns      3675264 ns          190 bytes_per_second=759.919M/s
BM_UValidate/0                   21020 ns        21018 ns        32769 bytes_per_second=4.53746G/s html
BM_UValidate/1                  343949 ns       343920 ns         2032 bytes_per_second=1.90122G/s urls
BM_UValidate/2                     138 ns          138 ns      5060369 bytes_per_second=828.669G/s jpg
BM_UValidate/3                    96.5 ns         96.5 ns      7259557 bytes_per_second=1.93099G/s jpg_200
BM_UValidate/4                    1971 ns         1971 ns       355117 bytes_per_second=48.3824G/s pdf
BM_UValidate/5                  107708 ns       107704 ns         6453 bytes_per_second=3.54185G/s html4
BM_UValidate/6                  126669 ns       126662 ns         5503 bytes_per_second=1.11829G/s txt1
BM_UValidate/7                  115561 ns       115555 ns         6048 bytes_per_second=1033.1M/s txt2
BM_UValidate/8                  335215 ns       335177 ns         2084 bytes_per_second=1.18578G/s txt3
BM_UValidate/9                  485964 ns       485937 ns         1440 bytes_per_second=945.675M/s txt4
BM_UValidate/10                  17975 ns        17974 ns        38397 bytes_per_second=6.14461G/s pb
BM_UValidate/11                  92493 ns        92484 ns         7493 bytes_per_second=1.85612G/s gaviota
BM_UValidateMedley             1698160 ns      1697991 ns          412 bytes_per_second=1.60628G/s
BM_UIOVecSource/0               110784 ns       110774 ns         6189 bytes_per_second=881.582M/s html (22.24 %)
BM_UIOVecSource/1              1404417 ns      1404231 ns          497 bytes_per_second=476.818M/s urls (47.84 %)
BM_UIOVecSource/2                11290 ns        11290 ns        61909 bytes_per_second=10.1543G/s jpg (99.95 %)
BM_UIOVecSource/3                  401 ns          401 ns      1740294 bytes_per_second=475.167M/s jpg_200 (73.00 %)
BM_UIOVecSource/4                15506 ns        15505 ns        44572 bytes_per_second=6.15058G/s pdf (83.31 %)
BM_UIOVecSource/5               468629 ns       468574 ns         1489 bytes_per_second=833.645M/s html4 (22.52 %)
BM_UIOVecSource/6               435987 ns       435960 ns         1602 bytes_per_second=332.699M/s txt1 (57.87 %)
BM_UIOVecSource/7               379021 ns       378988 ns         1842 bytes_per_second=314.997M/s txt2 (62.02 %)
BM_UIOVecSource/8              1176435 ns      1176304 ns          593 bytes_per_second=345.986M/s txt3 (55.17 %)
BM_UIOVecSource/9              1542281 ns      1542143 ns          453 bytes_per_second=297.987M/s txt4 (66.41 %)
BM_UIOVecSource/10              100702 ns       100698 ns         6639 bytes_per_second=1123.1M/s pb (19.61 %)
BM_UIOVecSource/11              378524 ns       378494 ns         1843 bytes_per_second=464.423M/s gaviota (37.73 %)
BM_UIOVecSink/0                 104544 ns       104537 ns         6634 bytes_per_second=934.175M/s html
BM_UIOVecSink/1                 953243 ns       953153 ns          732 bytes_per_second=702.471M/s urls
BM_UIOVecSink/2                   4278 ns         4278 ns       163680 bytes_per_second=26.7987G/s jpg
BM_UIOVecSink/3                    321 ns          321 ns      2178453 bytes_per_second=593.574M/s jpg_200
BM_UIOVecSink/4                   9939 ns         9939 ns        69671 bytes_per_second=9.59571G/s pdf
BM_UFlatSink/0                   69330 ns        69325 ns        10097 bytes_per_second=1.37566G/s html
BM_UFlatSink/1                  766792 ns       766679 ns          911 bytes_per_second=873.328M/s urls
BM_UFlatSink/2                    3989 ns         3989 ns       175402 bytes_per_second=28.7414G/s jpg
BM_UFlatSink/3                     193 ns          193 ns      3632512 bytes_per_second=990.14M/s jpg_200
BM_UFlatSink/4                   10414 ns        10413 ns        67175 bytes_per_second=9.15846G/s pdf
BM_UFlatSink/5                  281174 ns       281148 ns         2490 bytes_per_second=1.35683G/s html4
BM_UFlatSink/6                  268597 ns       268580 ns         2606 bytes_per_second=540.038M/s txt1
BM_UFlatSink/7                  239854 ns       239832 ns         2919 bytes_per_second=497.765M/s txt2
BM_UFlatSink/8                  709052 ns       709044 ns          987 bytes_per_second=573.99M/s txt3
BM_UFlatSink/9                  998252 ns       998199 ns          701 bytes_per_second=460.368M/s txt4
BM_UFlatSink/10                  64567 ns        64561 ns        10834 bytes_per_second=1.71068G/s pb
BM_UFlatSink/11                 246419 ns       246401 ns         2840 bytes_per_second=713.396M/s gaviota
BM_ZFlat/0                      105792 ns       105784 ns         6501 bytes_per_second=923.169M/s html (22.24 %)
BM_ZFlat/1                     1385136 ns      1385007 ns          503 bytes_per_second=483.436M/s urls (47.84 %)
BM_ZFlat/2                        6665 ns         6665 ns       104897 bytes_per_second=17.2005G/s jpg (99.95 %)
BM_ZFlat/3                         333 ns          333 ns      2100000 bytes_per_second=572.702M/s jpg_200 (73.00 %)
BM_ZFlat/4                       11734 ns        11733 ns        58718 bytes_per_second=8.12806G/s pdf (83.31 %)
BM_ZFlat/5                      454034 ns       454014 ns         1539 bytes_per_second=860.381M/s html4 (22.52 %)
BM_ZFlat/6                      429958 ns       429920 ns         1622 bytes_per_second=337.373M/s txt1 (57.87 %)
BM_ZFlat/7                      374359 ns       374345 ns         1865 bytes_per_second=318.904M/s txt2 (62.02 %)
BM_ZFlat/8                     1161529 ns      1161412 ns          601 bytes_per_second=350.422M/s txt3 (55.17 %)
BM_ZFlat/9                     1522178 ns      1522169 ns          459 bytes_per_second=301.897M/s txt4 (66.41 %)
BM_ZFlat/10                      98833 ns        98829 ns         6839 bytes_per_second=1.11752G/s pb (19.61 %)
BM_ZFlat/11                     371033 ns       371013 ns         1882 bytes_per_second=473.787M/s gaviota (37.73 %)
BM_ZFlatAll                    6109015 ns      6108341 ns          113 bytes_per_second=457.228M/s 12 kTestDataFiles
BM_ZFlatIncreasingTableSize      34125 ns        34123 ns        20220 bytes_per_second=908.652M/s 7 tables

Most tests are either the same or better (often MUCH better). The only regressions are some of the validation tests, especially the ones that are already well over 1GB/s. Some of the validations are also faster, including the ValidateMedley test (1.60 -> 1.94GB/s). I tracked that down to a single change, and there is already a comment saying that the intent is to get the new codegen.

@@ -233,9 +233,7 @@ static inline std::pair<size_t, bool> FindMatchLength(const char* s1,
int shift = Bits::FindLSBSetNonZero64(xorval);
size_t matched_bytes = shift >> 3;
uint64_t a3 = UNALIGNED_LOAD64(s2 + 4);
#ifndef __x86_64__
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block was moved to #else to cleanly add an aarch64 case.

Comment on lines +251 to +256
#elif defined(__aarch64__)
asm("cmp %w[xorval], 0\n\t"
"csel %x[a2], %[a3], %[a2], eq\n\t"
: [a2] "+r" (a2)
: [a3] "r" (a3) , [xorval] "r" (xorval)
: "cc");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is what clang was generating for the default code on aarch64, but this the direct translation from the x86 asm. I verified that it is faster.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the generated code look before/after? Both GCC and LLVM should use CSEL for this, so I don't understand why this would help. Please file bug reports if not, otherwise it will never get fixed!

Note using random asm like this is generally a bad idea. If absolutely needed, it would be best to abstract it in a generic macro or inline function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note using random asm like this is generally a bad idea. If absolutely needed, it would be best to abstract it in a generic macro or inline function.

I agree! However, this code was already using inline asm like this right here for x86, and I am trying to follow local style. All I'm doing is adding a direct translation of the x86 asm to aarch64.

Both GCC and LLVM should use CSEL for this, so I don't understand why this would help. Please file bug reports if not, otherwise it will never get fixed!

The code already has a comment right above in the x86 case describing why clang doesn't use cmov:

snappy/snappy-internal.h

Lines 237 to 245 in 5ec5d16

// Ideally this would just be
//
// a2 = static_cast<uint32_t>(xorval) == 0 ? a3 : a2;
//
// However clang correctly infers that the above statement participates on
// a critical data dependency chain and thus, unfortunately, refuses to
// use a conditional move (it's tuned to cut data dependencies). In this
// case there is a longer parallel chain anyway AND this will be fairly
// unpredictable.
On a hunch that the same reasoning also applied to gcc and aarch64, I tried doing the direct port and observed improved perf. Here is the compression perf without this change (since this isn't called in the decompression path):

BM_ZFlat/0                      105175 ns       105174 ns         6539 bytes_per_second=928.519M/s html (22.24 %)
BM_ZFlat/1                     1385523 ns      1385503 ns          504 bytes_per_second=483.263M/s urls (47.84 %)
BM_ZFlat/2                        6700 ns         6699 ns       104479 bytes_per_second=17.1121G/s jpg (99.95 %)
BM_ZFlat/3                         333 ns          333 ns      2101538 bytes_per_second=572.801M/s jpg_200 (73.00 %)
BM_ZFlat/4                       11754 ns        11754 ns        58783 bytes_per_second=8.11358G/s pdf (83.31 %)
BM_ZFlat/5                      453779 ns       453780 ns         1540 bytes_per_second=860.825M/s html4 (22.52 %)
BM_ZFlat/6                      430039 ns       430031 ns         1623 bytes_per_second=337.286M/s txt1 (57.87 %)
BM_ZFlat/7                      374322 ns       374315 ns         1865 bytes_per_second=318.929M/s txt2 (62.02 %)
BM_ZFlat/8                     1159839 ns      1159756 ns          601 bytes_per_second=350.923M/s txt3 (55.17 %)
BM_ZFlat/9                     1523393 ns      1523306 ns          459 bytes_per_second=301.672M/s txt4 (66.41 %)
BM_ZFlat/10                      97944 ns        97940 ns         6771 bytes_per_second=1.12766G/s pb (19.61 %)
BM_ZFlat/11                     370635 ns       370624 ns         1881 bytes_per_second=474.284M/s gaviota (37.73 %)
BM_ZFlatAll                    6184921 ns      6184674 ns          111 bytes_per_second=451.585M/s 12 kTestDataFiles

The full results with the change are in the PR description, but to summarize, BM_ZFlatAll shows 473.433M/s, a 4.8% improvement. The pb case (protobufs?) was the most improved at 16%.

Based on that comment, I'm not sure that a bug report would necessarily be helpful here. It appears that while compilers are able to generate csel/cmov here, they currently decide not to because they have reason to suspect that in the general case it would be more harmful than helpful. However, in this specific case, there is local knowledge that it is worth using cmov/csel here, and with a fairly large end-to-end impact. Extending from what is said in the comment, (some with my own wild guesses) one of the issues is that the desired code here adds an unconditional dependency on a load that the compiler would otherwise move inside the branch and therefore could be avoided entirely. In this case it is very very likely to already be in L1, but the compiler may not know that, and if it wasn't that would be very bad for perf. If that is part of the reason, then it seems totally reasonable for the compiler to be conservative about adding data dependencies to loads.

That said, I do plan to file some gcc bugs after this. In particular, the codegen for memmove-by-memcpy-through-temp-array is really bad and has even regressed from the last release.

What does the generated code look before/after?

It is a bit hard to compare because it this function is inlined into a larger one, and this change changes register allocations around and renumbers labels, but here is the relevant bit (with a few common instructions before and after):

        rbit    x6, x1
        clz     x6, x6
        asr     w2, w6, 3
        cbnz    w1, .L79
        ldr     x5, [x9, 8]
.L79:
        add     w2, w2, 4
        and     w6, w6, 24
        sub     x4, x9, x4
        sxtw    x2, w2
        add     x1, x9, x2
        lsr     x6, x5, x6
.L80:
        lsl     w5, w4, 8
        rbit    x1, x8
        clz     x1, x1
        and     w7, w1, 24
        lsr     x1, x1, 3
        add     x1, x1, 4
        ldr     x14, [x4, 8]
        add     x4, x4, x1
        cmp w8, 0
        csel x2, x14, x2, eq
        lsr     x7, x2, x7
.L79:
        lsl     w2, w11, 8

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting back to this, GCC14 fixes the memmove-by-memcpy-through-temp-array issue (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618): https://www.godbolt.org/z/qMraefr1K

There is a redundant stack reservation for larger sizes but this won't be an issue in non-trivial functions.

Comment on lines +103 to +105
// GCC dispatches to libc for memmoves > 16 bytes, so we need to
// do some work to get good code from that compiler. Clang handles
// powers-of-2 at least up to 64 well.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh. This is really annoying and I don't feel good about this code. However, the perf impact is massive.

https://www.godbolt.org/z/EjP9zv3e7

It looks like gcc is better for aarch64 in trunk, but not in any released version (as of today), and not for x86_64 either.

Comment on lines +1114 to +1118
#elif defined(__aarch64__)
// Emperically it is faster to just copy all 64 rather than branching.
(void)kShortMemCopy;
(void)size;
FixedSizeMemMove<64>(dst, src);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only novel part of the patch that isn't just massaging the code to do the same thing slightly differently. If you'd prefer, I can pull it out to a separate PR. I also tried move32(); if (size>32) moveAnother32() (rather than dispatching to variable-length memmove in that case) and the unconditional version was still faster.

Note that this makes the FixedSizeMemMove<32> specialization dead code. So it may make sense to remove it if you keep this change. Alternatively, maybe keep it in case you add another fixed-size memmove of 32 bytes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me all this is overengineering - we know that src < dst since it's typical for a decompressor to copy earlier bytes to append to the end of a buffer. Writing more bytes than requested would be impossible if dst < src since it will corrupt src.

So using memcpy is safe even if there is overlap beyond the requested copy size. The order of the emitted loads and stores doesn't matter.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me all this is overengineering

Are you referring to my change right here to unconditionally copy 64 bytes, the existing code (which was already calling memmove), or my introduction of FixedSizeMemMove?

My only change at this location was to stop branching on size > 32 and just unconditionally copy all 64 bytes. The second copy is cheap enough to not be worth the overhead of making it conditional. This is of course cpu-dependent, but at least seems likely to be the case for any modern CPU core that can have multiple loads/stores in flight concurrently. I only did this change on aarch64 since that is what I was tuning on, but it is probably also worth testing a similar change for the x86_64 version immediately above this.

So using memcpy is safe even if there is overlap beyond the requested copy size.

I had considered just using memcpy because in practice it should be fine for this use case, however there were a few reasons I didn't:

  1. The justification for safety relies on memcpy copying bytes from front to back, but that isn't guaranteed. It would not be safe if some implementation decided to copy from back to front.
  2. According to the spec, it is immediate UB to call memcpy with overlapping buffers, not simply an unspecified result. This means that compilers could nasal demon you. That is probably unlikely here, but more problematic is that it would make it impossible to use UBSAN with snappy.
  3. I suspect (admittedly without testing) that the codegen for memcpy (which interleaves loads and stores) will have worse performance in this case. If there is overlap, a later load is no longer independent of earlier stores. This can trigger a store-to-load hazard in the CPU and prevent them from executing fully in parallel. To make it worse, this code will often be dealing with unaligned loads/stores* which are fast in the normal case, but have an even higher load-store penalty, often requiring a full flush of the store buffer. This is completely avoided by doing all loads before the stores. (Of course you could still have overlap with a prior call to MemCopy64, but that is much less problematic than having it within a single call)

* In addition to having adresses that are not naturally aligned, the potentially bigger problem is that the loads would often be differently aligned from the preceding stores, meaning that the load would need to pull from multiple stores which CPUs tend to punt on (or place severe restrictions on)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm referring to all the optimized code that does the copy, both your addition and the existing code. It's correct to use memcpy here. The order of the loads/stores does not matter, any order will work (ie. your point 1 does not hold). As it happens on AArch64 all loads are emitted before all stores so store-load hazards cannot occur (ie. point 3 is not an issue): https://godbolt.org/z/hva13E3z8

So it would be fine to prefer memmove on targets that emit it inline and use memcpy otherwise as the fallback (with UBSAN still prefer memmove). If people complained about inefficient codegen (x86-64 memcpy and memmove look bad in GCC) then there would be no need for any of these hacks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If people complained about inefficient codegen (x86-64 memcpy and memmove look bad in GCC) then there would be no need for any of these hacks

Just to be clear, you are preaching to the choir. I have reported several bad codegen bugs to both gcc and clang. However, even if they were all fixed in trunk tomorrow, that doesn't help me since I need to get good perf out of the compiler we use in production. And there can be over a year of lag between something hitting gcc truck and it being in a released compiler, since perf improvements are often not backported. Then I need to wait for us to upgrade our toolchain (which is why I'm testing against gcc 11.2 🙁) So there is still a need for perf hacks to convince current compilers to generate good code. I'll leave it up to the snappy maintainers whether they want to wait for compilers to improve or merge a change to get good perf out of existing compilers.

As it happens on AArch64 all loads are emitted before all stores so store-load hazards cannot occur

It looks like gcc has changed their codegen on trunk vs latest release. On released versions it does the interleaved loads and stores for memcpy. I'm not sure that doing the loads upfront in memcpy is always better especially after inlining into real code since that requires more registers (more of a problem for longer memcpys and when inlined into functions already under register pressure).

The order of the loads/stores does not matter, any order will work (ie. your point 1 does not hold)
Hmm, I think you're right, my bad.

I still don't love the idea of intentionally calling memcpy with overlapping ranges. But if that is what the snappy maintainers decide to do, then so be it.

Comment on lines +1161 to +1163
const size_t next_tag = is_literal ? (*tag >> 2) + 1 : tag_type;
*tag = ip[next_tag];
ip += (next_tag) + 1;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the change that caused the regression on validate. I checked that with this change it is generating the desired csinc instruction, which the comment above says is the intended codegen. Here is the perf with just this change reverted:

BM_UFlat/0                       36099 ns        36098 ns        19334 bytes_per_second=2.64188G/s html
BM_UFlat/1                      457307 ns       457297 ns         1530 bytes_per_second=1.42986G/s urls
BM_UFlat/2                        3956 ns         3956 ns       176952 bytes_per_second=28.9819G/s jpg
BM_UFlat/3                         160 ns          160 ns      4394167 bytes_per_second=1.16619G/s jpg_200
BM_UFlat/4                        5788 ns         5788 ns       120704 bytes_per_second=16.478G/s pdf
BM_UFlat/5                      155126 ns       155115 ns         4509 bytes_per_second=2.45927G/s html4
BM_UFlat/6                      171845 ns       171839 ns         4073 bytes_per_second=844.067M/s txt1
BM_UFlat/7                      154670 ns       154670 ns         4525 bytes_per_second=771.836M/s txt2
BM_UFlat/8                      452391 ns       452362 ns         1547 bytes_per_second=899.687M/s txt3
BM_UFlat/9                      649864 ns       649864 ns         1077 bytes_per_second=707.13M/s txt4
BM_UFlat/10                      31264 ns        31263 ns        22349 bytes_per_second=3.53272G/s pb
BM_UFlat/11                     148201 ns       148197 ns         4721 bytes_per_second=1.15833G/s gaviota
BM_UFlatMedley                 2295899 ns      2295829 ns          305 bytes_per_second=1.188G/s
BM_UValidate/0                   20800 ns        20799 ns        33000 bytes_per_second=4.5851G/s html
BM_UValidate/1                  345447 ns       345427 ns         2020 bytes_per_second=1.89293G/s urls
BM_UValidate/2                     138 ns          138 ns      5059972 bytes_per_second=828.941G/s jpg
BM_UValidate/3                    96.5 ns         96.5 ns      7215460 bytes_per_second=1.93108G/s jpg_200
BM_UValidate/4                    1972 ns         1972 ns       354938 bytes_per_second=48.3615G/s pdf
BM_UValidate/5                  107941 ns       107936 ns         6443 bytes_per_second=3.53424G/s html4
BM_UValidate/6                  126895 ns       126888 ns         5501 bytes_per_second=1.11629G/s txt1
BM_UValidate/7                  114857 ns       114856 ns         6066 bytes_per_second=1039.39M/s txt2
BM_UValidate/8                  334870 ns       334859 ns         2087 bytes_per_second=1.18691G/s txt3
BM_UValidate/9                  486272 ns       486261 ns         1443 bytes_per_second=945.045M/s txt4
BM_UValidate/10                  17879 ns        17878 ns        38501 bytes_per_second=6.17749G/s pb
BM_UValidate/11                  92356 ns        92352 ns         7539 bytes_per_second=1.85876G/s gaviota
BM_UValidateMedley             1697835 ns      1697754 ns          412 bytes_per_second=1.6065G/s
BM_UIOVecSource/0               103097 ns       103092 ns         6599 bytes_per_second=947.274M/s html (22.24 %)
BM_UIOVecSource/1              1363926 ns      1363884 ns          512 bytes_per_second=490.923M/s urls (47.84 %)
BM_UIOVecSource/2                11296 ns        11296 ns        61854 bytes_per_second=10.1487G/s jpg (99.95 %)
BM_UIOVecSource/3                  402 ns          402 ns      1740237 bytes_per_second=475.039M/s jpg_200 (73.00 %)
BM_UIOVecSource/4                15463 ns        15463 ns        44743 bytes_per_second=6.16753G/s pdf (83.31 %)
BM_UIOVecSource/5               442147 ns       442130 ns         1580 bytes_per_second=883.507M/s html4 (22.52 %)
BM_UIOVecSource/6               416784 ns       416773 ns         1672 bytes_per_second=348.015M/s txt1 (57.87 %)
BM_UIOVecSource/7               367446 ns       367440 ns         1897 bytes_per_second=324.897M/s txt2 (62.02 %)
BM_UIOVecSource/8              1121587 ns      1121523 ns          622 bytes_per_second=362.885M/s txt3 (55.17 %)
BM_UIOVecSource/9              1522044 ns      1521967 ns          459 bytes_per_second=301.937M/s txt4 (66.41 %)
BM_UIOVecSource/10               90576 ns        90574 ns         7349 bytes_per_second=1.21938G/s pb (19.61 %)
BM_UIOVecSource/11              354641 ns       354634 ns         1954 bytes_per_second=495.669M/s gaviota (37.73 %)
BM_UIOVecSink/0                 102030 ns       102026 ns         6793 bytes_per_second=957.168M/s html
BM_UIOVecSink/1                 954178 ns       954179 ns          732 bytes_per_second=701.715M/s urls
BM_UIOVecSink/2                   4264 ns         4263 ns       164116 bytes_per_second=26.8888G/s jpg
BM_UIOVecSink/3                    321 ns          321 ns      2178205 bytes_per_second=593.638M/s jpg_200
BM_UIOVecSink/4                   9772 ns         9772 ns        70791 bytes_per_second=9.75947G/s pdf
BM_UFlatSink/0                   36097 ns        36096 ns        19360 bytes_per_second=2.64205G/s html
BM_UFlatSink/1                  457191 ns       457142 ns         1530 bytes_per_second=1.43034G/s urls
BM_UFlatSink/2                    3920 ns         3919 ns       178767 bytes_per_second=29.2498G/s jpg
BM_UFlatSink/3                     165 ns          165 ns      4253941 bytes_per_second=1.13175G/s jpg_200
BM_UFlatSink/4                    5820 ns         5820 ns       119850 bytes_per_second=16.3866G/s pdf
BM_UFlatSink/5                  154998 ns       154992 ns         4515 bytes_per_second=2.46122G/s html4
BM_UFlatSink/6                  171897 ns       171891 ns         4075 bytes_per_second=843.809M/s txt1
BM_UFlatSink/7                  154886 ns       154878 ns         4513 bytes_per_second=770.801M/s txt2
BM_UFlatSink/8                  452911 ns       452904 ns         1545 bytes_per_second=898.611M/s txt3
BM_UFlatSink/9                  649481 ns       649457 ns         1075 bytes_per_second=707.573M/s txt4
BM_UFlatSink/10                  31247 ns        31246 ns        22371 bytes_per_second=3.53464G/s pb
BM_UFlatSink/11                 148706 ns       148703 ns         4701 bytes_per_second=1.15439G/s gaviota
BM_ZFlat/0                       96115 ns        96112 ns         7178 bytes_per_second=1016.07M/s html (22.24 %)
BM_ZFlat/1                     1339282 ns      1339210 ns          521 bytes_per_second=499.968M/s urls (47.84 %)
BM_ZFlat/2                        6655 ns         6654 ns       104791 bytes_per_second=17.2275G/s jpg (99.95 %)
BM_ZFlat/3                         328 ns          328 ns      2136911 bytes_per_second=580.819M/s jpg_200 (73.00 %)
BM_ZFlat/4                       11728 ns        11728 ns        58781 bytes_per_second=8.13163G/s pdf (83.31 %)
BM_ZFlat/5                      424216 ns       424212 ns         1653 bytes_per_second=920.824M/s html4 (22.52 %)
BM_ZFlat/6                      411154 ns       411134 ns         1696 bytes_per_second=352.789M/s txt1 (57.87 %)
BM_ZFlat/7                      362120 ns       362110 ns         1925 bytes_per_second=329.679M/s txt2 (62.02 %)
BM_ZFlat/8                     1106496 ns      1106418 ns          630 bytes_per_second=367.84M/s txt3 (55.17 %)
BM_ZFlat/9                     1501500 ns      1501436 ns          465 bytes_per_second=306.066M/s txt4 (66.41 %)
BM_ZFlat/10                      86265 ns        86262 ns         7658 bytes_per_second=1.28032G/s pb (19.61 %)
BM_ZFlat/11                     347773 ns       347764 ns         1998 bytes_per_second=505.462M/s gaviota (37.73 %)
BM_ZFlatAll                    5898864 ns      5898386 ns          117 bytes_per_second=473.503M/s 12 kTestDataFiles
BM_ZFlatIncreasingTableSize      33035 ns        33035 ns        20928 bytes_per_second=938.586M/s 7 tables

It is a mixed bag for Validate, but even in the cases where it regresses there, it is still faster for actual decompression.

@RedBeard0531 RedBeard0531 marked this pull request as ready for review January 18, 2024 15:41
@RedBeard0531
Copy link
Author

Obviously, all of my claims of "faster" are only on the core (graviton2/neoverse-n1) and compiler (gcc-11.2) that I tested on. I assume that you will be able to run this through a gauntlet of platforms that you care about prior to merging to ensure there are no regressions there.

}
}

#ifdef __aarch64__ // Implies neon support

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#ifdef __aarch64__ // Implies neon support
#if SNAPPY_HAVE_VECTOR_BYTE_SHUFFLE

Just in case there is a weird ARM platform that doesn't have the arm_neon header, it may make more sense to ifdef this on the same condition that V128 is defined on, on line 94 a few lines up.

@danilak-G
Copy link
Collaborator

Hi, sorry for a long delay. I promise to take a look and hopefully merge it this week. At Google we don't care much about gcc and thus this was overlooked but we understand the need for the community

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants