-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce a new feature to use dynamic dispatch to select between ADX and portable implementations at runtime #174
base: master
Are you sure you want to change the base?
Conversation
Also make the `__blst_cpuid` and `__blst_platform_cap` symbols accessible to other C files.
… and portable implementations at runtime If blst is compiled with the new `-D__BLST_DYNAMIC__` flag, then blst will use dynamic dispatch to run either optimized ADX code or portable code. This is an alternative to `-D__BLST_PORTABLE__` with better performance characteristics. The implementation uses: - on GCC/clang with the ELF standard: `ifunc` - on GCC/clang without ELF: function pointers and `constructor` - on MSVC: function pointers and `.CRT$XCU`
This feature triggers compilation with `-D__ADX__ -D__BLST_DYNAMIC__`.
But the portable build comes with a run-time switch to ADX code path now[/already]... |
Yes, I'm aware. That's implemented using jumps ( I would argue that using indirect calls also simplifies the code, if compared to using conditional jumps, however my PR is adding a new feature, not changing the existing portable feature, so this is a moot point. Last but not least, if the technique used in this PR was applied to the top-level functions ( |
This arguably qualifies as an extra-ordinary claim. Because things being equal there is no mechanism for indirect jump to deliver better performance, let alone that much. It's either a misinterpretation of results or indication of an anomaly that you hit accidentally. The latter means that the suggested approach also runs a risk of comparable penalty. Which, again, is essentially inconceivable to imagine. So how do I replicate the results? Provide concrete instructions.
There is an established mechanism to achieve it as is. Compile a pair of shared libraries and choose between the two at application startup time. |
My personal [opinion] is that replicating established mechanisms, a.k.a. shared libraries, falls far beyond the focus[/scope] of this project. |
Just in case, this implies that you also share your results... [And the expectation is that switching from ADX to portable would result in significant performance drop.] |
In my testing vs BLST, having test+jumps has no measurable overhead vs even hardcoded function calls (i.e. before the runtime selection). And also vs JIT with indirect calls from MCL (https://github.com/herumi/mcl/). Tests done on both low-level field multiplications to high-level BLS signature verification including extension fields or elliptic curve arithmetic in-between. I've detailed the current hardware architecture, in particular regarding branch predictions in this comment #10 (comment). Note that branches in this case are 100% predictable. |
Oh, so the results were misinterpreted! I mean originally it read 6x to 12x. Well, small percentages are more imaginable. As already mentioned, any particular approach is prone to hitting some anomalies[/quirks] in hardware and variations don't actually reflect inherent advantages of any particular approach, rather the circumstantial nature of performance of complex execution flows. Well, arguably 14% is hard to tolerate, but it still doesn't speak in favour of indirect branches' superiority. So still tell how to reproduce it :-)
Branch prediction logic is akin to cache, in the sense that predictions can be evicted and as result branches rendered non-predicted. As already implied in #10, "100%" applies in controlled environment like benchmarks, but in real life it's more complicated. Heck, you can run into anomalies even in controlled environments, see https://github.com/supranational/blst/blob/master/src/asm/ct_inverse_mod_256-x86_64.pl#L746 for an even local example... |
Just in case, it's not like branch predictability is the only factor that can affect the performance. In other words, it's not given that variations in question are the direct[/sole] effect of interference with branch prediction logic, so let's not hang on to it... [The remark was rather a reaction to "100%" itself. I mean real life is never "100%.";-)] |
To clarify. In respect to performance. The thing is that there are other criteria that affect the choice. For example, in my book direct jumps are preferable because of security implications from indirect ones... |
No worries, don't deal in absolutes ;)
https://easyperf.net/blog/2018/01/18/Code_alignment_issues and also https://lkml.org/lkml/2015/5/21/443 and for anomalies in benchmark in general and not just code alignment: |
Come on! As if I don't know the stuff. My remark was actually more nuanced than that. Well, of course the nuance is not obvious, but it shouldn't grant the assumption that it was a trivial matter of alignment of the subroutine's entry point;-) The hint was the explicit reference to Coffee Lake. Because it's a problem specifically on Coffee Lake. And 40% is totally disproportional in the context, a loop with >16 iterations with strong data dependencies between iterations. Oh well...
This kind of aligns with what I'm talking about. One data point |
|
Summary
If blst is compiled with the new
-D__BLST_DYNAMIC__
flag, then blst will use dynamic dispatch to run either optimized ADX code or portable code. Internal functions likemul_mont_384
which have an ADX-optimized counterpart (mulx_mont_384
) will be bound when blst is loaded by the linker.This is an alternative to
-D__BLST_PORTABLE__
with better performance characteristics.How this is implemented
This code uses different implementations depending on the platform it's being compiled for:
ifunc
is used. This is the same approach used by glibc for functions likememcpy
, which may have different implementations depending on the CPU capabilities. The wayifunc
works is similar to how symbols from a dynamic library are loaded.constructor
. This is not as efficient asifunc
because it generatescall
instructions with pointers as targets..CRT$XCU
. Same remarks as above.Future improvements
This code is dynamically binding low-level internal functions like
mul_mont_384
, but it may be way more efficient to use the same strategy on global functions likeblst_fp_inverse
: that would result in fewer relocations/fewer calls to pointers.