Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf tuning for gcc + aarch64 #176

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 18 additions & 6 deletions snappy-internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -233,9 +233,7 @@ static inline std::pair<size_t, bool> FindMatchLength(const char* s1,
int shift = Bits::FindLSBSetNonZero64(xorval);
size_t matched_bytes = shift >> 3;
uint64_t a3 = UNALIGNED_LOAD64(s2 + 4);
#ifndef __x86_64__
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block was moved to #else to cleanly add an aarch64 case.

a2 = static_cast<uint32_t>(xorval) == 0 ? a3 : a2;
#else
#ifdef __x86_64__
// Ideally this would just be
//
// a2 = static_cast<uint32_t>(xorval) == 0 ? a3 : a2;
Expand All @@ -250,6 +248,14 @@ static inline std::pair<size_t, bool> FindMatchLength(const char* s1,
: "+r"(a2)
: "r"(a3), "r"(xorval)
: "cc");
#elif defined(__aarch64__)
asm("cmp %w[xorval], 0\n\t"
"csel %x[a2], %[a3], %[a2], eq\n\t"
: [a2] "+r" (a2)
: [a3] "r" (a3) , [xorval] "r" (xorval)
: "cc");
Comment on lines +251 to +256
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is what clang was generating for the default code on aarch64, but this the direct translation from the x86 asm. I verified that it is faster.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the generated code look before/after? Both GCC and LLVM should use CSEL for this, so I don't understand why this would help. Please file bug reports if not, otherwise it will never get fixed!

Note using random asm like this is generally a bad idea. If absolutely needed, it would be best to abstract it in a generic macro or inline function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note using random asm like this is generally a bad idea. If absolutely needed, it would be best to abstract it in a generic macro or inline function.

I agree! However, this code was already using inline asm like this right here for x86, and I am trying to follow local style. All I'm doing is adding a direct translation of the x86 asm to aarch64.

Both GCC and LLVM should use CSEL for this, so I don't understand why this would help. Please file bug reports if not, otherwise it will never get fixed!

The code already has a comment right above in the x86 case describing why clang doesn't use cmov:

snappy/snappy-internal.h

Lines 237 to 245 in 5ec5d16

// Ideally this would just be
//
// a2 = static_cast<uint32_t>(xorval) == 0 ? a3 : a2;
//
// However clang correctly infers that the above statement participates on
// a critical data dependency chain and thus, unfortunately, refuses to
// use a conditional move (it's tuned to cut data dependencies). In this
// case there is a longer parallel chain anyway AND this will be fairly
// unpredictable.
On a hunch that the same reasoning also applied to gcc and aarch64, I tried doing the direct port and observed improved perf. Here is the compression perf without this change (since this isn't called in the decompression path):

BM_ZFlat/0                      105175 ns       105174 ns         6539 bytes_per_second=928.519M/s html (22.24 %)
BM_ZFlat/1                     1385523 ns      1385503 ns          504 bytes_per_second=483.263M/s urls (47.84 %)
BM_ZFlat/2                        6700 ns         6699 ns       104479 bytes_per_second=17.1121G/s jpg (99.95 %)
BM_ZFlat/3                         333 ns          333 ns      2101538 bytes_per_second=572.801M/s jpg_200 (73.00 %)
BM_ZFlat/4                       11754 ns        11754 ns        58783 bytes_per_second=8.11358G/s pdf (83.31 %)
BM_ZFlat/5                      453779 ns       453780 ns         1540 bytes_per_second=860.825M/s html4 (22.52 %)
BM_ZFlat/6                      430039 ns       430031 ns         1623 bytes_per_second=337.286M/s txt1 (57.87 %)
BM_ZFlat/7                      374322 ns       374315 ns         1865 bytes_per_second=318.929M/s txt2 (62.02 %)
BM_ZFlat/8                     1159839 ns      1159756 ns          601 bytes_per_second=350.923M/s txt3 (55.17 %)
BM_ZFlat/9                     1523393 ns      1523306 ns          459 bytes_per_second=301.672M/s txt4 (66.41 %)
BM_ZFlat/10                      97944 ns        97940 ns         6771 bytes_per_second=1.12766G/s pb (19.61 %)
BM_ZFlat/11                     370635 ns       370624 ns         1881 bytes_per_second=474.284M/s gaviota (37.73 %)
BM_ZFlatAll                    6184921 ns      6184674 ns          111 bytes_per_second=451.585M/s 12 kTestDataFiles

The full results with the change are in the PR description, but to summarize, BM_ZFlatAll shows 473.433M/s, a 4.8% improvement. The pb case (protobufs?) was the most improved at 16%.

Based on that comment, I'm not sure that a bug report would necessarily be helpful here. It appears that while compilers are able to generate csel/cmov here, they currently decide not to because they have reason to suspect that in the general case it would be more harmful than helpful. However, in this specific case, there is local knowledge that it is worth using cmov/csel here, and with a fairly large end-to-end impact. Extending from what is said in the comment, (some with my own wild guesses) one of the issues is that the desired code here adds an unconditional dependency on a load that the compiler would otherwise move inside the branch and therefore could be avoided entirely. In this case it is very very likely to already be in L1, but the compiler may not know that, and if it wasn't that would be very bad for perf. If that is part of the reason, then it seems totally reasonable for the compiler to be conservative about adding data dependencies to loads.

That said, I do plan to file some gcc bugs after this. In particular, the codegen for memmove-by-memcpy-through-temp-array is really bad and has even regressed from the last release.

What does the generated code look before/after?

It is a bit hard to compare because it this function is inlined into a larger one, and this change changes register allocations around and renumbers labels, but here is the relevant bit (with a few common instructions before and after):

        rbit    x6, x1
        clz     x6, x6
        asr     w2, w6, 3
        cbnz    w1, .L79
        ldr     x5, [x9, 8]
.L79:
        add     w2, w2, 4
        and     w6, w6, 24
        sub     x4, x9, x4
        sxtw    x2, w2
        add     x1, x9, x2
        lsr     x6, x5, x6
.L80:
        lsl     w5, w4, 8
        rbit    x1, x8
        clz     x1, x1
        and     w7, w1, 24
        lsr     x1, x1, 3
        add     x1, x1, 4
        ldr     x14, [x4, 8]
        add     x4, x4, x1
        cmp w8, 0
        csel x2, x14, x2, eq
        lsr     x7, x2, x7
.L79:
        lsl     w2, w11, 8

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting back to this, GCC14 fixes the memmove-by-memcpy-through-temp-array issue (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618): https://www.godbolt.org/z/qMraefr1K

There is a redundant stack reservation for larger sizes but this won't be an issue in non-trivial functions.

#else
a2 = static_cast<uint32_t>(xorval) == 0 ? a3 : a2;
#endif
*data = a2 >> (shift & (3 * 8));
return std::pair<size_t, bool>(matched_bytes, true);
Expand All @@ -276,14 +282,20 @@ static inline std::pair<size_t, bool> FindMatchLength(const char* s1,
int shift = Bits::FindLSBSetNonZero64(xorval);
size_t matched_bytes = shift >> 3;
uint64_t a3 = UNALIGNED_LOAD64(s2 + 4);
#ifndef __x86_64__
a2 = static_cast<uint32_t>(xorval) == 0 ? a3 : a2;
#else
#ifdef __x86_64__
asm("testl %k2, %k2\n\t"
"cmovzq %1, %0\n\t"
: "+r"(a2)
: "r"(a3), "r"(xorval)
: "cc");
#elif defined(__aarch64__)
asm("cmp %w[xorval], 0\n\t"
"csel %x[a2], %[a3], %[a2], eq\n\t"
: [a2] "+r" (a2)
: [a3] "r" (a3) , [xorval] "r" (xorval)
: "cc");
#else
a2 = static_cast<uint32_t>(xorval) == 0 ? a3 : a2;
#endif
*data = a2 >> (shift & (3 * 8));
matched += matched_bytes;
Expand Down
79 changes: 65 additions & 14 deletions snappy.cc
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,57 @@ using internal::V128_StoreU;
using internal::V128_DupChar;
#endif

// GCC dispatches to libc for memmoves > 16 bytes, so we need to
// do some work to get good code from that compiler. Clang handles
// powers-of-2 at least up to 64 well.
Comment on lines +103 to +105
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh. This is really annoying and I don't feel good about this code. However, the perf impact is massive.

https://www.godbolt.org/z/EjP9zv3e7

It looks like gcc is better for aarch64 in trunk, but not in any released version (as of today), and not for x86_64 either.

#if !defined(__GNUC__) || defined(__clang__)
template <size_t SIZE>
SNAPPY_ATTRIBUTE_ALWAYS_INLINE
inline void FixedSizeMemMove(void* dest, const void* src) {
memmove(dest, src, SIZE);
}
#else

template <size_t SIZE>
SNAPPY_ATTRIBUTE_ALWAYS_INLINE
inline void FixedSizeMemMove(void* dest, const void* src) {
if (SIZE <= 16) {
// gcc has patterns for memmove up to 16 bytes
memmove(dest, src, SIZE);
} else {
// This generates reasonable code on x86_64, but on aarch64 this produces a
// dead store to tmp, plus takes up stack space.
char tmp[SIZE];
memcpy(tmp, src, SIZE);
memcpy(dest, tmp, SIZE);
}
}

#ifdef __aarch64__ // Implies neon support

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#ifdef __aarch64__ // Implies neon support
#if SNAPPY_HAVE_VECTOR_BYTE_SHUFFLE

Just in case there is a weird ARM platform that doesn't have the arm_neon header, it may make more sense to ifdef this on the same condition that V128 is defined on, on line 94 a few lines up.

template <>
SNAPPY_ATTRIBUTE_ALWAYS_INLINE
inline void FixedSizeMemMove<32>(void* dest, const void* src) {
V128 a = V128_LoadU(reinterpret_cast<const V128*>(src));
V128 b = V128_LoadU(reinterpret_cast<const V128*>(src) + 1);
V128_StoreU(reinterpret_cast<V128*>(dest), a);
V128_StoreU(reinterpret_cast<V128*>(dest) + 1, b);
}

template <>
SNAPPY_ATTRIBUTE_ALWAYS_INLINE
inline void FixedSizeMemMove<64>(void* dest, const void* src) {
V128 a = V128_LoadU(reinterpret_cast<const V128*>(src));
V128 b = V128_LoadU(reinterpret_cast<const V128*>(src) + 1);
V128 c = V128_LoadU(reinterpret_cast<const V128*>(src) + 2);
V128 d = V128_LoadU(reinterpret_cast<const V128*>(src) + 3);
V128_StoreU(reinterpret_cast<V128*>(dest), a);
V128_StoreU(reinterpret_cast<V128*>(dest) + 1, b);
V128_StoreU(reinterpret_cast<V128*>(dest) + 2, c);
V128_StoreU(reinterpret_cast<V128*>(dest) + 3, d);
}
#endif
#endif

// We translate the information encoded in a tag through a lookup table to a
// format that requires fewer instructions to decode. Effectively we store
// the length minus the tag part of the offset. The lowest significant byte
Expand Down Expand Up @@ -1060,13 +1111,18 @@ void MemCopy64(char* dst, const void* src, size_t size) {
data = _mm256_lddqu_si256(static_cast<const __m256i *>(src) + 1);
_mm256_storeu_si256(reinterpret_cast<__m256i *>(dst) + 1, data);
}
#elif defined(__aarch64__)
// Emperically it is faster to just copy all 64 rather than branching.
(void)kShortMemCopy;
(void)size;
FixedSizeMemMove<64>(dst, src);
Comment on lines +1114 to +1118
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only novel part of the patch that isn't just massaging the code to do the same thing slightly differently. If you'd prefer, I can pull it out to a separate PR. I also tried move32(); if (size>32) moveAnother32() (rather than dispatching to variable-length memmove in that case) and the unconditional version was still faster.

Note that this makes the FixedSizeMemMove<32> specialization dead code. So it may make sense to remove it if you keep this change. Alternatively, maybe keep it in case you add another fixed-size memmove of 32 bytes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me all this is overengineering - we know that src < dst since it's typical for a decompressor to copy earlier bytes to append to the end of a buffer. Writing more bytes than requested would be impossible if dst < src since it will corrupt src.

So using memcpy is safe even if there is overlap beyond the requested copy size. The order of the emitted loads and stores doesn't matter.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me all this is overengineering

Are you referring to my change right here to unconditionally copy 64 bytes, the existing code (which was already calling memmove), or my introduction of FixedSizeMemMove?

My only change at this location was to stop branching on size > 32 and just unconditionally copy all 64 bytes. The second copy is cheap enough to not be worth the overhead of making it conditional. This is of course cpu-dependent, but at least seems likely to be the case for any modern CPU core that can have multiple loads/stores in flight concurrently. I only did this change on aarch64 since that is what I was tuning on, but it is probably also worth testing a similar change for the x86_64 version immediately above this.

So using memcpy is safe even if there is overlap beyond the requested copy size.

I had considered just using memcpy because in practice it should be fine for this use case, however there were a few reasons I didn't:

  1. The justification for safety relies on memcpy copying bytes from front to back, but that isn't guaranteed. It would not be safe if some implementation decided to copy from back to front.
  2. According to the spec, it is immediate UB to call memcpy with overlapping buffers, not simply an unspecified result. This means that compilers could nasal demon you. That is probably unlikely here, but more problematic is that it would make it impossible to use UBSAN with snappy.
  3. I suspect (admittedly without testing) that the codegen for memcpy (which interleaves loads and stores) will have worse performance in this case. If there is overlap, a later load is no longer independent of earlier stores. This can trigger a store-to-load hazard in the CPU and prevent them from executing fully in parallel. To make it worse, this code will often be dealing with unaligned loads/stores* which are fast in the normal case, but have an even higher load-store penalty, often requiring a full flush of the store buffer. This is completely avoided by doing all loads before the stores. (Of course you could still have overlap with a prior call to MemCopy64, but that is much less problematic than having it within a single call)

* In addition to having adresses that are not naturally aligned, the potentially bigger problem is that the loads would often be differently aligned from the preceding stores, meaning that the load would need to pull from multiple stores which CPUs tend to punt on (or place severe restrictions on)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm referring to all the optimized code that does the copy, both your addition and the existing code. It's correct to use memcpy here. The order of the loads/stores does not matter, any order will work (ie. your point 1 does not hold). As it happens on AArch64 all loads are emitted before all stores so store-load hazards cannot occur (ie. point 3 is not an issue): https://godbolt.org/z/hva13E3z8

So it would be fine to prefer memmove on targets that emit it inline and use memcpy otherwise as the fallback (with UBSAN still prefer memmove). If people complained about inefficient codegen (x86-64 memcpy and memmove look bad in GCC) then there would be no need for any of these hacks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If people complained about inefficient codegen (x86-64 memcpy and memmove look bad in GCC) then there would be no need for any of these hacks

Just to be clear, you are preaching to the choir. I have reported several bad codegen bugs to both gcc and clang. However, even if they were all fixed in trunk tomorrow, that doesn't help me since I need to get good perf out of the compiler we use in production. And there can be over a year of lag between something hitting gcc truck and it being in a released compiler, since perf improvements are often not backported. Then I need to wait for us to upgrade our toolchain (which is why I'm testing against gcc 11.2 🙁) So there is still a need for perf hacks to convince current compilers to generate good code. I'll leave it up to the snappy maintainers whether they want to wait for compilers to improve or merge a change to get good perf out of existing compilers.

As it happens on AArch64 all loads are emitted before all stores so store-load hazards cannot occur

It looks like gcc has changed their codegen on trunk vs latest release. On released versions it does the interleaved loads and stores for memcpy. I'm not sure that doing the loads upfront in memcpy is always better especially after inlining into real code since that requires more registers (more of a problem for longer memcpys and when inlined into functions already under register pressure).

The order of the loads/stores does not matter, any order will work (ie. your point 1 does not hold)
Hmm, I think you're right, my bad.

I still don't love the idea of intentionally calling memcpy with overlapping ranges. But if that is what the snappy maintainers decide to do, then so be it.

#else
std::memmove(dst, src, kShortMemCopy);
FixedSizeMemMove<kShortMemCopy>(dst, src);
// Profiling shows that nearly all copies are short.
if (SNAPPY_PREDICT_FALSE(size > kShortMemCopy)) {
std::memmove(dst + kShortMemCopy,
static_cast<const uint8_t*>(src) + kShortMemCopy,
64 - kShortMemCopy);
FixedSizeMemMove<kShortMemCopy>(
dst + kShortMemCopy,
static_cast<const uint8_t*>(src) + kShortMemCopy);
}
#endif
}
Expand Down Expand Up @@ -1102,14 +1158,9 @@ inline size_t AdvanceToNextTagARMOptimized(const uint8_t** ip_p, size_t* tag) {
// instruction (csinc) and it removes several register moves.
const size_t tag_type = *tag & 3;
const bool is_literal = (tag_type == 0);
if (is_literal) {
size_t next_literal_tag = (*tag >> 2) + 1;
*tag = ip[next_literal_tag];
ip += next_literal_tag + 1;
} else {
*tag = ip[tag_type];
ip += tag_type + 1;
}
const size_t next_tag = is_literal ? (*tag >> 2) + 1 : tag_type;
*tag = ip[next_tag];
ip += (next_tag) + 1;
Comment on lines +1161 to +1163
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the change that caused the regression on validate. I checked that with this change it is generating the desired csinc instruction, which the comment above says is the intended codegen. Here is the perf with just this change reverted:

BM_UFlat/0                       36099 ns        36098 ns        19334 bytes_per_second=2.64188G/s html
BM_UFlat/1                      457307 ns       457297 ns         1530 bytes_per_second=1.42986G/s urls
BM_UFlat/2                        3956 ns         3956 ns       176952 bytes_per_second=28.9819G/s jpg
BM_UFlat/3                         160 ns          160 ns      4394167 bytes_per_second=1.16619G/s jpg_200
BM_UFlat/4                        5788 ns         5788 ns       120704 bytes_per_second=16.478G/s pdf
BM_UFlat/5                      155126 ns       155115 ns         4509 bytes_per_second=2.45927G/s html4
BM_UFlat/6                      171845 ns       171839 ns         4073 bytes_per_second=844.067M/s txt1
BM_UFlat/7                      154670 ns       154670 ns         4525 bytes_per_second=771.836M/s txt2
BM_UFlat/8                      452391 ns       452362 ns         1547 bytes_per_second=899.687M/s txt3
BM_UFlat/9                      649864 ns       649864 ns         1077 bytes_per_second=707.13M/s txt4
BM_UFlat/10                      31264 ns        31263 ns        22349 bytes_per_second=3.53272G/s pb
BM_UFlat/11                     148201 ns       148197 ns         4721 bytes_per_second=1.15833G/s gaviota
BM_UFlatMedley                 2295899 ns      2295829 ns          305 bytes_per_second=1.188G/s
BM_UValidate/0                   20800 ns        20799 ns        33000 bytes_per_second=4.5851G/s html
BM_UValidate/1                  345447 ns       345427 ns         2020 bytes_per_second=1.89293G/s urls
BM_UValidate/2                     138 ns          138 ns      5059972 bytes_per_second=828.941G/s jpg
BM_UValidate/3                    96.5 ns         96.5 ns      7215460 bytes_per_second=1.93108G/s jpg_200
BM_UValidate/4                    1972 ns         1972 ns       354938 bytes_per_second=48.3615G/s pdf
BM_UValidate/5                  107941 ns       107936 ns         6443 bytes_per_second=3.53424G/s html4
BM_UValidate/6                  126895 ns       126888 ns         5501 bytes_per_second=1.11629G/s txt1
BM_UValidate/7                  114857 ns       114856 ns         6066 bytes_per_second=1039.39M/s txt2
BM_UValidate/8                  334870 ns       334859 ns         2087 bytes_per_second=1.18691G/s txt3
BM_UValidate/9                  486272 ns       486261 ns         1443 bytes_per_second=945.045M/s txt4
BM_UValidate/10                  17879 ns        17878 ns        38501 bytes_per_second=6.17749G/s pb
BM_UValidate/11                  92356 ns        92352 ns         7539 bytes_per_second=1.85876G/s gaviota
BM_UValidateMedley             1697835 ns      1697754 ns          412 bytes_per_second=1.6065G/s
BM_UIOVecSource/0               103097 ns       103092 ns         6599 bytes_per_second=947.274M/s html (22.24 %)
BM_UIOVecSource/1              1363926 ns      1363884 ns          512 bytes_per_second=490.923M/s urls (47.84 %)
BM_UIOVecSource/2                11296 ns        11296 ns        61854 bytes_per_second=10.1487G/s jpg (99.95 %)
BM_UIOVecSource/3                  402 ns          402 ns      1740237 bytes_per_second=475.039M/s jpg_200 (73.00 %)
BM_UIOVecSource/4                15463 ns        15463 ns        44743 bytes_per_second=6.16753G/s pdf (83.31 %)
BM_UIOVecSource/5               442147 ns       442130 ns         1580 bytes_per_second=883.507M/s html4 (22.52 %)
BM_UIOVecSource/6               416784 ns       416773 ns         1672 bytes_per_second=348.015M/s txt1 (57.87 %)
BM_UIOVecSource/7               367446 ns       367440 ns         1897 bytes_per_second=324.897M/s txt2 (62.02 %)
BM_UIOVecSource/8              1121587 ns      1121523 ns          622 bytes_per_second=362.885M/s txt3 (55.17 %)
BM_UIOVecSource/9              1522044 ns      1521967 ns          459 bytes_per_second=301.937M/s txt4 (66.41 %)
BM_UIOVecSource/10               90576 ns        90574 ns         7349 bytes_per_second=1.21938G/s pb (19.61 %)
BM_UIOVecSource/11              354641 ns       354634 ns         1954 bytes_per_second=495.669M/s gaviota (37.73 %)
BM_UIOVecSink/0                 102030 ns       102026 ns         6793 bytes_per_second=957.168M/s html
BM_UIOVecSink/1                 954178 ns       954179 ns          732 bytes_per_second=701.715M/s urls
BM_UIOVecSink/2                   4264 ns         4263 ns       164116 bytes_per_second=26.8888G/s jpg
BM_UIOVecSink/3                    321 ns          321 ns      2178205 bytes_per_second=593.638M/s jpg_200
BM_UIOVecSink/4                   9772 ns         9772 ns        70791 bytes_per_second=9.75947G/s pdf
BM_UFlatSink/0                   36097 ns        36096 ns        19360 bytes_per_second=2.64205G/s html
BM_UFlatSink/1                  457191 ns       457142 ns         1530 bytes_per_second=1.43034G/s urls
BM_UFlatSink/2                    3920 ns         3919 ns       178767 bytes_per_second=29.2498G/s jpg
BM_UFlatSink/3                     165 ns          165 ns      4253941 bytes_per_second=1.13175G/s jpg_200
BM_UFlatSink/4                    5820 ns         5820 ns       119850 bytes_per_second=16.3866G/s pdf
BM_UFlatSink/5                  154998 ns       154992 ns         4515 bytes_per_second=2.46122G/s html4
BM_UFlatSink/6                  171897 ns       171891 ns         4075 bytes_per_second=843.809M/s txt1
BM_UFlatSink/7                  154886 ns       154878 ns         4513 bytes_per_second=770.801M/s txt2
BM_UFlatSink/8                  452911 ns       452904 ns         1545 bytes_per_second=898.611M/s txt3
BM_UFlatSink/9                  649481 ns       649457 ns         1075 bytes_per_second=707.573M/s txt4
BM_UFlatSink/10                  31247 ns        31246 ns        22371 bytes_per_second=3.53464G/s pb
BM_UFlatSink/11                 148706 ns       148703 ns         4701 bytes_per_second=1.15439G/s gaviota
BM_ZFlat/0                       96115 ns        96112 ns         7178 bytes_per_second=1016.07M/s html (22.24 %)
BM_ZFlat/1                     1339282 ns      1339210 ns          521 bytes_per_second=499.968M/s urls (47.84 %)
BM_ZFlat/2                        6655 ns         6654 ns       104791 bytes_per_second=17.2275G/s jpg (99.95 %)
BM_ZFlat/3                         328 ns          328 ns      2136911 bytes_per_second=580.819M/s jpg_200 (73.00 %)
BM_ZFlat/4                       11728 ns        11728 ns        58781 bytes_per_second=8.13163G/s pdf (83.31 %)
BM_ZFlat/5                      424216 ns       424212 ns         1653 bytes_per_second=920.824M/s html4 (22.52 %)
BM_ZFlat/6                      411154 ns       411134 ns         1696 bytes_per_second=352.789M/s txt1 (57.87 %)
BM_ZFlat/7                      362120 ns       362110 ns         1925 bytes_per_second=329.679M/s txt2 (62.02 %)
BM_ZFlat/8                     1106496 ns      1106418 ns          630 bytes_per_second=367.84M/s txt3 (55.17 %)
BM_ZFlat/9                     1501500 ns      1501436 ns          465 bytes_per_second=306.066M/s txt4 (66.41 %)
BM_ZFlat/10                      86265 ns        86262 ns         7658 bytes_per_second=1.28032G/s pb (19.61 %)
BM_ZFlat/11                     347773 ns       347764 ns         1998 bytes_per_second=505.462M/s gaviota (37.73 %)
BM_ZFlatAll                    5898864 ns      5898386 ns          117 bytes_per_second=473.503M/s 12 kTestDataFiles
BM_ZFlatIncreasingTableSize      33035 ns        33035 ns        20928 bytes_per_second=938.586M/s 7 tables

It is a mixed bag for Validate, but even in the cases where it regresses there, it is still faster for actual decompression.

return tag_type;
}

Expand Down Expand Up @@ -2013,7 +2064,7 @@ class SnappyArrayWriter {
*op_p = IncrementalCopy(op - offset, op, op_end, op_limit_);
return true;
}
std::memmove(op, op - offset, kSlopBytes);
FixedSizeMemMove<kSlopBytes>(op, op - offset);
*op_p = op_end;
return true;
}
Expand Down Expand Up @@ -2265,7 +2316,7 @@ class SnappyScatteredWriter {
}
// Fast path
char* const op_end = op + len;
std::memmove(op, op - offset, kSlopBytes);
FixedSizeMemMove<kSlopBytes>(op, op - offset);
*op_p = op_end;
return true;
}
Expand Down