Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support 64-bit limbs on no-asm platforms #109

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

uweigand
Copy link

@uweigand uweigand commented Mar 4, 2022

Currently, platforms without assembler support always use 32-bit limbs,
but the Rust bindings always assume 64-bit limbs. This breaks on
big-endian platforms like our IBM Z (s390x).

This patch enables 64-bit limbs on 64-bit platforms even if there is
no hand-written assembler, by using a 128-bit integer type in the
C implementation (this is an extension that is widely supported on
64-bit platforms with GCC or LLVM).

This fixes the broken Rust bindings on IBM Z, and also improves
performance by a factor or 3 or more, because compiler-generated
code handling __int128 already uses the 64x64->128 multiply
instruction our ISA provides.

To improve performance of compiler-generated code a bit more, this
also switches to the -O3 optimization level, which helps with
unrolling of the Montgomery multiply core loop.

@mratsim
Copy link
Contributor

mratsim commented Mar 5, 2022

For all platforms that don't support native 64x64->128 multiply (does MIPS or Riscv5 for example?) is the GCC/Clang fallback constant-time or does it use ternary or comparisons for carries?

@uweigand
Copy link
Author

uweigand commented Mar 7, 2022

For all platforms that don't support native 64x64->128 multiply (does MIPS or Riscv5 for example?) is the GCC/Clang fallback constant-time or does it use ternary or comparisons for carries?

The 64-bit versions of MIPS and RISC-V support a native 64x64->128 multiply (either as a single instruction or as a pair of instructions).

I can find only one 64-bit target supported by LLVM that doesn't have a native implementation (sparc64), and this uses a call to the __multi3 library routine. I guess it depends on where this library is taken from, but at least the LLVM (compiler-rt) default implementation of it doesn't appear to use conditional code paths:
https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/builtins/multi3.c

If this is a concern, we could make use of 64-bit limbs conditional on the architecture, with s390x (and possibly others where it makes sense) opting in?

@uweigand
Copy link
Author

uweigand commented May 4, 2022

Ping? Any update on this?

@emlowe
Copy link

emlowe commented Sep 1, 2023

I know this PR is over a year old now, but I recently was porting Chia to a VisionFive2 riscv64 SBC (StarFive JH7110 with RISC-V quad-core CPU with 2 MB L2 cache and a monitor core, supporting RV64GC ISA, working up to 1.5 GHz ) and I decided to try this patch to see if there were any speed improvements.

Unfortunately, our wrapper library and test code crashes and dumps core when using this patch. (https://github.com/Chia-Network/bls-signatures)

@uweigand
Copy link
Author

uweigand commented Sep 8, 2023

Unfortunately, our wrapper library and test code crashes and dumps core when using this patch. (https://github.com/Chia-Network/bls-signatures)

I was able to reproduce this on s390x. The problem is this __builtin_assume in src/no_asm.h:

static limb_t quot_rem_n(limb_t *div_rem, const limb_t *divisor,
                                          limb_t quotient, size_t n)
{
    __builtin_assume(n != 0 && n%2 == 0);

which was added by commit 7cc9267 after I originally created and tested this PR.

This assumes that quot_rem_n is always called with an n that is a multiple of two. However, there are only two call sites:

inline limb_t quot_rem_128(limb_t *div_rem, const limb_t *divisor,
                                            limb_t quotient)
{   return quot_rem_n(div_rem, divisor, quotient, NLIMBS(128));   }

inline limb_t quot_rem_64(limb_t *div_rem, const limb_t *divisor,
                                           limb_t quotient)
{   return quot_rem_n(div_rem, divisor, quotient, NLIMBS(64));   }

so n is always equal to NLIMBS(128) or NLIMBS(64). Without this patch, the limb size in this file is always 32, so n is either 4 or 2. But with this patch, the limb size becomes 64, so n is 2 or 1, breaking the assumption.

If I simply remove the && n%2 == 0, the test suite (./build/src/runtest) passes again for me, and the benchmark suite (./build/src/runbench) shows significant improvements, up to a factor of 3 or so.

I think removing this part of the assumption (or even the whole assumption) should be fine on other platforms too, as this function will likely be inlined into its two call sites where n becomes constant anyway. But I haven't verified this.

@emlowe
Copy link

emlowe commented Dec 13, 2023

Took me some time to get back to this.

but YES, applying your suggestion and all tests pass in the VisionFive2 SBC and it is considerably faster at roughly 2.5 times faster for signing and verification - pretty much everything is 2.5 times faster

I would LOVE for this to get merged in - until we get someone to write some actual riscv assembler this is likely as good as it can get.

Currently, platforms without assembler support always use 32-bit limbs,
but the Rust bindings always assume 64-bit limbs.  This breaks on
big-endian platforms like our IBM Z (s390x).

This patch enables 64-bit limbs on 64-bit platforms even if there is
no hand-written assembler, by using a 128-bit integer type in the
C implementation (this is an extension that is widely supported on
64-bit platforms with GCC or LLVM).

Note that this means that the argument "n" to quot_rem_n is no
longer guaranteed to always be a multiple of 2, so the corresponding
assertion needs to be removed as well.

This fixes the broken Rust bindings on IBM Z, and also improves
performance by a factor or 3 or more, because compiler-generated
code handling __int128 already uses the 64x64->128 multiply
instruction our ISA provides.

To improve performance of compiler-generated code a bit more, this
also switches to the -O3 optimization level, which helps with
unrolling of the Montgomery multiply core loop.
@uweigand
Copy link
Author

but YES, applying your suggestion and all tests pass in the VisionFive2 SBC and it is considerably faster at roughly 2.5 times faster for signing and verification - pretty much everything is 2.5 times faster

Thanks for verifying! I've now updated this to fix the incorrect assertion.

@hoffmang9
Copy link

Any chance this can get in with the next major release?

@lemenkov
Copy link

I can confirm this patch indeed fixes the issue on s390x (at least). Right now I was able to pass the c-kzg tests on s390x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants