-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize instruction fetch and decoding #226
base: main
Are you sure you want to change the base?
Conversation
5271d12
to
a6511aa
Compare
e2f145c
to
5f3cf35
Compare
0dfdd1a
to
f0ef931
Compare
f0ef931
to
507d7bc
Compare
507d7bc
to
588d6ca
Compare
9d3098b
to
d847ace
Compare
d847ace
to
158681d
Compare
71ddca6
to
d0c601a
Compare
I rebased my optimizations PR on top of Perna's PR, tests passed, but to my surprise the |
b27e578
to
ccd366b
Compare
ccd366b
to
0688024
Compare
0688024
to
50bf4e7
Compare
This optimizes our RISC-V instruction decoder by using big jump tables, through token threading, to the point that decoding takes just 2 instructions for most RISC-V instructions, even compressed ones. And overall the FENCE instruction takes 12 host instructions in GCC AMD64, and Clang ARM64.
Here is the GCC x86_64 trace as proof:
And the Clang arm64:
In emulator v0.18.x the trace for same FENCE RISC-V instruction took about 40 x86_64 instructions.
Overall the performance varies between 1.2x up to 2x speedup across many benchmarks relative to emulator v0.18.1, here is results for many benchmarks with
stress-ng
:Benchmarks
You can see 1.94 a speedup for integer operations. Notably I am able to reach GHz speed for some simple integer arithmetic benchmarks, with the interpreter being only 10~20x slower than host native.
The table of benchmarks were created by running
hyperfine
andstress-ng
, for example:PREVIOUS PR ITERATION COMMENTS
This a micro optimization at x86_64 assembly level of the instruction fetch+decode hot path. In summary this PR should save about 22 x86_64 instructions from every interpreter hot loop iteration. This optimization does not apply only to x86_64, but all architectures should benefit from.
Baseline
First I generated a hot trace of subsequent
FENCE.I
instruction calls. I choose this instruction because it is the most simple instruction, it basically does nothing, it's the ideal instruction to measure instruction fetch ovearhead. This was the trace for one iteration:This trace keeps looping in x86_64. We can see that in optimal conditions it takes exactly 40 x86_64 instructions to execute one
FENCE.I
in this trace, where:I usually say that the cartesi machine is about 30~40 times slower than native, if we think about the ratio 40:1 in this trace, this is very close to what I usually say. If we can get this trace to execute with fewer x86_64 instruction, we can also get the cartesi machine interpreter to be faster for all instructions (not only this one).
If we look closely in the fetch, there are these two branches:
My idea was to come up with a single branch that could test both conditions in the fetch loop, simplifying to just one branch, so I could save some instructions.
This is the benchmark for baseline.
Round 1 - Optimize fetch
After some thinking I come up with the changes presented in the PR to optimize instruction fetch, which generates the following new trace:
We can see that in optimal conditions it takes exactly 34 x86_64 instructions to execute one
FENCE.I
in this trace, where:So in summary 6 instructions were optimized out from the very hot path. Tese are the new numbers for benchmarks:
We can see improvements in all benchmarks, where:
1.269 / 1.333 = 95%
time to execute, minus 5%.611.176 / 574.50 = ~ 6%
improvement in instruction execution speed for RV64I instruction setRound 2 - Optimize decoding for uncompressed instruction
The decoding is using 24 instructions of the 34 instructions in the trace, this is about 70%! It's dominating the hot loop trace, imagine if we cut it in a half, maybe we could with jump tables.
EDIT: I decided to give a try to optimize the decoding code in a way so the GCC compiler can optimize it to jump tables. After some thinking and research I added a new commit to this PR, and this is the new trace:
We can see that it takes exactly 27 x86_64 instructions to execute one FENCE.I in this trace! Where:
However this adds one memory indirection to lookup the jump table, but this is fine, this memory indirection is mostly likely cached in L1 CPU cache.
These are the new benchmark numbers:
Whoa that is:
FENCE.I
instruction is +86% fasterRV64I
instructions are +35% fasterAlso some instructions are over 1GHz!
Round 3 - Single jump with computed gotos
Wasting 4 instructions every iteration just for checking if a instruction is compressed or not is not ideal, we could try to compile the compressed instruction switch and the uncompressed instruction switch into a single switch.
After trying to make a very large switch (2048 entries) GCC would refuse to use a large jump table, then I went making my own manual jump table, it ended up a large array with 2048 entries generated from a lua script, and used GCC's computed goto to make use of it. This is the new trace:
We can see that it takes exactly 18 x86_64 instructions to execute one FENCE.I in this trace!!! Where:
So we went from 40 instruction from base line to 18 instructions, this should improve performance for all instructions, because all instructions always go to fetch and decoding.
Let's see the benchmarks:
Whoa that is:
FENCE.I
instruction is +146% fasterRV64I
instructions are +55% fasterAlso many instructions are over 1GHz speed:
arm64 trace
I also made a trace for this PR on arm64, this is it:
In short:
Looks like arm64 is more instruction efficient than x86_64.