Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please clarify ALU doc #281

Open
stacksmith opened this issue Sep 30, 2024 · 26 comments
Open

Please clarify ALU doc #281

stacksmith opened this issue Sep 30, 2024 · 26 comments

Comments

@stacksmith
Copy link

The ALU writeup in doc folder is very interesting but does not seem match reality... In ALU mode, there seem to be only 3 inputs (I1, I2 and I3), unlike the documented A,B,C and D...

The lack of the 4th input and the severely limited ability to configure the LUT makes me wonder if the Gowin FPGAs really have a hardware carry chain, or is it faked in the LUT?

Which way does the carry chain run, topologically (and what is the best placement of registers to propagate the carry?)

Any information or pointers in the right direction are appreciated.

@yrabbit
Copy link
Collaborator

yrabbit commented Sep 30, 2024

I can answer you only for the wires - they go as it is drawn on the first picture: from left to right along row of the chip from the left edge to the right, and inside each cell from roughly speaking from LUT0 to LUT5.

CIN and COUT wires are uncommutated and you can't connect to them directly, only using ALU.

@stacksmith
Copy link
Author

Thanks... I'm trying to figure out how to get inside this.

To a suspicious outside observer it certainly looks as if one of the LUT inputs is used as CIN, and there is some partial output from the LUT (normally hidden) used as COUT... This could actually be useful, if there was a way to fully configure the damn LUT. I'd love to use the carry for incrementing, and the lut as a mux, for instance. Xilinx is pretty good that way.

But it does look like some kind of trickery, and I cannot yet see what is gained by this obfuscation. Perhaps the ability to pretend to have carry for marketing purposes?

@yrabbit
Copy link
Collaborator

yrabbit commented Sep 30, 2024

Well, if you feel the craving for adventures, you can replace the lines

apicula/apycula/gowin_pack.py

Lines 2256 to 2261 in 4f87247

if mode in alu_bel.modes:
bits = alu_bel.modes[mode]
else:
bits = alu_bel.modes[str(int(mode, 2))]
for r, c in bits:
tile[r][c] = 1

with place_lut call. Naturally, taking care so that you have an INIT parameter with the content of LUT.
This make gowin_pack switch CFU to ALU mode, but use your LUT contents.

@pepijndevos
Copy link
Member

pepijndevos commented Sep 30, 2024

Yeah you totally can program the LUT in any way you like when using it in ALU mode, just not using the vendor primitive. I demonstrate this in the last paragraph of the docs.

The ALU documentation is based on reverse engineering what is actually going on and not on any official documentation, so it is possible we've missed something, but what is described in the docs is very much how we observe it to work. It also closely matches how the Lattice ECP5 ALU works, which is known to have a very similar internal architecture.

I think it is actually possible to use the fourth LUT input in ALU mode, but you have to consider that the lower 4 bits are shared with the second LUT, which is conveniently circumvented by not using the inputs that would use those bits.

I would be open to supporting unofficial ALU modes that have practical applications.

@stacksmith
Copy link
Author

stacksmith commented Sep 30, 2024

yrabbit: Thanks, that is worth considering. Is mode 2 literally ALU mode 2 or does it mean something different?

pepinjdevos: Thank you, that's what I want to hear! How would you go about configuring the LUT and activating ALU? Is there a way without patching gowin_pack, or is yrabbit's patch the only way to go?

WIth all due respect, you do not demonstrate configuring the ALU in the last paragraph of the doc, only state that you have done so. I would love to see a demonstration!

And thank you for your great work!

@yrabbit
Copy link
Collaborator

yrabbit commented Sep 30, 2024

yrabbit: Thanks, that is worth considering. Is mode 2 literally ALU mode 2 or does it mean something different?

As far as I remember, yes. You can see what LUT contents we use for which mode:

apicula/apycula/chipdb.py

Lines 347 to 380 in 4f87247

# ADD INIT="0011 0000 1100 1100"
# add 0 add carry
add_alu_mode(mode, bel.modes, lut, "0", "0011000011001100")
# SUB INIT="1010 0000 0101 1010"
# add 0 add carry
add_alu_mode(mode, bel.modes, lut, "1", "1010000001011010")
# ADDSUB INIT="0110 0000 1001 1010"
# add 0 sub carry
add_alu_mode(mode, bel.modes, lut, "2", "0110000010011010")
add_alu_mode(mode, bel.modes, lut, "hadder", "1111000000000000")
# NE INIT="1001 0000 1001 1111"
# add 0 sub carry
add_alu_mode(mode, bel.modes, lut, "3", "1001000010011111")
# GE
add_alu_mode(mode, bel.modes, lut, "4", "1001000010011010")
# LE
# no mode, just swap I0 and I1
# CUP
add_alu_mode(mode, bel.modes, lut, "6", "1010000010100000")
# CDN
add_alu_mode(mode, bel.modes, lut, "7", "0101000001011111")
# CUPCDN
# The functionality of this seems to be the same with SUB
# add_alu_mode(mode, bel.modes, lut, "8", "1010000001011010")
# MULT INIT="0111 1000 1000 1000"
#
add_alu_mode(mode, bel.modes, lut, "9", "0111100010001000")
# CIN->LOGIC INIT="0000 0000 0000 0000"
# nop 0 nop carry
# side effect: clears the carry
add_alu_mode(mode, bel.modes, lut, "C2L", "0000000000000000")
# 1->CIN INIT="0000 0000 0000 1111"
# nop 0 nop carry
add_alu_mode(mode, bel.modes, lut, "ONE2C", "0000000000001111")

@stacksmith
Copy link
Author

stacksmith commented Sep 30, 2024

Oh snap! So the modes are not hardwired, and you actually stuff the LUTS! I was assuming that the FPGA was somehow dereferencing a hidden ROM... That is actually good news!

I am particulary interested in using all 4 inputs of the LUT as follows: Two inputs into the adder, and a third input which can be muxed (with the fourth input), so I can get either adder result or the other input. All while using the carry to increment (and preferably suppressing the carry when muxing the non-adder input). I think it's doable, although I haven't constructed the bitmap yet.

This would make a perfect Program Counter, for instance, capable of running sequentially, adding an offset, or loading an address (returning from a subroutine, for instance).

Is there no way to use the carry chain with regular LUT configurations? Is the ALU mode a simplification for 'regular people' who feel figuring out a LUT is too hard?

@yrabbit
Copy link
Collaborator

yrabbit commented Oct 1, 2024

Is there no way to use the carry chain with regular LUT configurations? Is the ALU mode a simplification for 'regular people' who feel figuring out a LUT is too hard?

Well, I'm one of those people. But there are gurus who, if you wake them up in the middle of the night, will draw any LUT with their eyes closed.

Note that it was not by chance that I indicated a specific place where you can substitute your INI in gowin_pack - the thing is that fuses are installed a little higher, which switch two adjacent LUTs to ALU mode. Without this, you won't get Carry at all, but with their inclusion, you must already take into account all the logic that is connected as in the picture in the documentation (where there are many AND, XOR, etc.). So these are two different configurations: LUT vs ALU.

@stacksmith
Copy link
Author

stacksmith commented Oct 1, 2024

yrabbit: Thanks. I just need a couple of clarifications (forgive my ignorance):

  • are you suggesting that I add_alu_mode.. my own mode with the LUT the way I like?
  • what about the code immediately below, which seems to limit access to only 3 inputs? Can I just add the missing 'I2':f"C{alu_idx}", right in there?
            bel.portmap = {
                'COUT': f"COUT{alu_idx}",
                'CIN': f"CIN{alu_idx}",
                'SUM': f"F{alu_idx}",
                'I0': f"A{alu_idx}",
                'I1': f"B{alu_idx}",
                'I3': f"D{alu_idx}",
            }
  • I'm not familiar with the sequence of operation of the toolchain. After a modification here in chipdb.py, does anything need to be recompiled in the toolchain itself, or do I just run the usual makefile on my verilog?

Or should I change gowin_pack.py, replacing the lines indicated with place_lut(..)... Would that allow me to use LUT input names or the 3 ALU inputs from chipdb.py? And how would I get to this from verilog then?

Again, apologies for what may be dumb questions. I really appreciate your help.

@yrabbit
Copy link
Collaborator

yrabbit commented Oct 1, 2024

The point is that it's relatively easy to experiment with the contents of the LUT and that's what the mechanism I suggested does.
To add an input is a completely different song - we will have to change the routing mechanism, which is located in nextpnr, and even before that we will have to change the script, which translates ‘our’ chip base into the one understandable for nextpnr.

I don't know your level of familiarity with nextpnr sources, but let's say individual ALUs are connected in clusters, if you have no problems with this piece of code, then of course you can add an input:

https://github.com/YosysHQ/nextpnr/blob/master/himbaechel/uarch/gowin/pack.cc#L1033-L1217

@stacksmith
Copy link
Author

Just to be clear, you are saying that I can change the packer to modify LUT contents, but then I still have only 3 inputs to work with?

If I start that way and replace the lines indicated with a place_lut call, do I just create an ALU instance, but add an .INIT?

@yrabbit
Copy link
Collaborator

yrabbit commented Oct 1, 2024

Just to be clear, you are saying that I can change the packer to modify LUT contents, but then I still have only 3 inputs to work with?

yes

If I start that way and replace the lines indicated with a place_lut call, do I just create an ALU instance, but add an .INIT?

yes, yosys may screw up, but you can always put this parameter directly in JSON after nextpnr.

To a first approximation, if you're serious about ALU, the steps are roughly as follows:

  • change the file path to yosys/gowin/cells_sim.v so that the ALU has another input;
  • compile yosys, check for errors on a test example;
  • modify apycula/chipdb.py;
  • generate a new apicula chip database;
  • in nextpnr sources change file https://github.com/YosysHQ/nextpnr/blob/master/himbaechel/uarch/gowin/gowin_arch_gen.py so that your new input will be included in nextpnr chip database;
  • modify nextpnr's pack.cc;
  • ... if you got here and nextpnr generates the JSON you need, then fixing gowin_pack if necessary is a small thing ;)

@stacksmith
Copy link
Author

Thank you so much! I have enough to work with now.

I am kind of serious -- here is a rare opportunity of a small change which is compatible, yet makes a hidden part of the circuit available to those who are up to the challenge.

With a full LUT and carry, this FPGA is almost as good as Xilinx... Well, they have a separate carry function generator but with a bit of ingenuity, you can make very interesting counters and ALUs.

I will work on the easy way and the JSON, and get a handle on the codebase, and see how feasible this is -- and maybe bug you some more later!

@stacksmith
Copy link
Author

@pepinjdevos -- do you by any chance have a trick that allows you to use the LUT in ALU mode, with all 4 inputs? How did you do what you described in the last paragraph?

@pepijndevos
Copy link
Member

You can only use the full lut if the bottom 4 bits happen to align with what you need in the second lut since they share those bits. So outside of super crafty hacks and lucky chances you can really only use 3 inputs.

What I did can be achieved by simply adding a new ALU mode with the contents as described.

It could be worthwhile trying to understand the packer. It takes tho yosys alu primitive and iirc breaks it down into the flag and lut primitives that gowin_pack actually deals with. So you might bypass the alu primitive and directly create the constituent Luts with all four inputs.

Fwiw I have played around a bit with trying to coax the alu into new and useful tricks but wasn't able to come up with something substantially more useful than the Gowin modes.

@stacksmith
Copy link
Author

@pepijndevos: What do you mean by 'the second lut'? Can you give me a few verilog lines that would instantiate what you are talking about? Like, what do I connect the 4th input to?

You mention above that you can program it any way, just not using the vendor's primitive. Which primitive can I use?

@yrabbit -- the simple way above did not work, as there seem to be other ALU cells being placed in my simple verilog test. I had to identify the specific ALU (by using a fake ALU_MODE "SPECIAL", which I replace with 0 and pass to place_lut). Seems to do something different from addition, now I have to figure out exactly what!

@yrabbit
Copy link
Collaborator

yrabbit commented Oct 1, 2024

Well, it is somehow useless to describe all the features at once - until you soaked your legs figuratively speaking :)
NEXTPNR analyzes how you use Carry and adds ALU, which either set the initial CIN value for the entire chain and/or pull out COUT into the normal space of the switching wires. Besides do not forget that ALU is always working out in pairs: one and odd switch at the same time either to LUT or to the ALU mode, since physically there is only one responsible for this fuse and NEXTPNR is forced to take this into account and supplement your ALU to one if you have managed to use their odd amount.

@stacksmith
Copy link
Author

stacksmith commented Oct 2, 2024

At this point I've isolated the test to the _pnr.json file containing a normal ALU test case. I manually change ALU_MODE to "SPECIAL" and add an "INIT", and let the doctored gowin_pack set ALU_MODE back to 0 and call place_lut(..). I am able to insert the init 0011000011001100 and simulate an adder. I am trying to figure out how to it works, and why those 4 zeros are there in bits 11:8.

I'm not even dealing with nextpnr for now.

The terminology is very confusing: Gowin's own ALU module in prim_syn.v shows (SUM, COUT, I0, I1, I3, and CIN). How does that map onto the LUT inputs? Where does I2 go?. And what about the A,B,C and confusingly, D inputs in the doc? If I0 is A, my counter should not work (I feed the flop output into I0 and use CI to increment -- according to the doc, A doesn't even count). It seems there is some remapping in various places which I can't find.

Also, nextpnr-gowin does not work with normal location constraints such as "INS_LOC "lu" R14C4[0][A]", so I can't fix my alu to work with it. I've been using nextpnr-himbaechel, which seems to do things differently.

@yrabbit
Copy link
Collaborator

yrabbit commented Oct 2, 2024

I2 = 1 and there is some renaming for mode 0:

https://github.com/YosysHQ/nextpnr/blob/b3b239289332395d4ea0a687b14faf841a499415/himbaechel/uarch/gowin/pack.cc#L1154-L1165

LUT inputs <=> alu inputs:

apicula/apycula/chipdb.py

Lines 381 to 387 in 4f87247

bel.portmap = {
'COUT': f"COUT{alu_idx}",
'CIN': f"CIN{alu_idx}",
'SUM': f"F{alu_idx}",
'I0': f"A{alu_idx}",
'I1': f"B{alu_idx}",
'I3': f"D{alu_idx}",

and do not use nextpnr-gowin, please.

@stacksmith
Copy link
Author

Thank you! I was using mode 0, and the remapping was driving me crazy!

@pepijndevos
Copy link
Member

Okay before you go overboard let's work out your program counter.

alu_logic

Starting from regular add

add(0)    0011000011001100  A:-  B:I0 C:1 D:I1 CIN:0

You can see this is adding B to D with C tied high and A unused. High C selects the first and third nibble, leaving the second and fourth nibble unused, with the fourth nibble being used for the lut2 in the picture, selecting B always.

So we can tell D selects the high or low byte and B selects the first or second pair, forming B XOR D. We want to maintain this, but can look into using A and C.

We can use C to select A. Step 1 is put 0011 in the second nibble, so now C is completely ignored. So now for C=0 we still have xor with the lut2 nibble serving double duty. Then we want to change the C=1 nibbles to just select A so 1010. For a total of

1010001110101100

Your question of can you ignore the carry is simple: No, you can see in the figure it's hardwired and bypasses the Lut. Having a fast carry chain without passing through the lut is the entire point.

So that poses a problem for our sum output which will always have cin xored into it. What's more, our current mux implementation still drives cout from A+D which will then get xored into the next alu. Can we at least stop outputting cin? It's complicated...

@pepijndevos
Copy link
Member

The carry section is a mindfuck because it's asymmetric but somehow works out. The way to think about it is that the lut4 output into the and gates is a mux. It either selects the carry in or the lut2, which just selects B. Since the lut4 is xor it selects the lut2 when both inputs are zero or one, in which case we want cout=A=D (the sum is 0 or 3) and in case the lut4 is 1, we select cin and xor that into the sum for a sum of 1 or 2.

So we control the lut2 but it's no good since if our lut4 outputs a 1, it is ignored completely.

You could try to get really crafty but in the end it seems to always come down to this: if you want to use a hardwired carry chain for things other than adding don't use a hardwired carry chain.

For example you could try to fish out the Lut output before the xor but at the point you'd need another mux to select the right one so you gain nothing over the straightforward solution.

Or you could break your brain over a mode that uses A to select B or C so your lut2 sees the mux select. Say for example if A=1 lut4=0 so sum=cin. But now there isn't anywhere for C to go so that's no good. We could make the lut2 A AND B so then can mux B to cout. But now we broke the full adder because we're addin C+D and now the carry is wrong. And okay we can mux B to the cout and then what, we've just performed a bit shift I guess??

So idk feel free to play around with it, it's fun to think through, but I haven't found anything more interesting than the one at the bottom of the docs.

@pepijndevos
Copy link
Member

So I guess maybe some things you could do are

  • make an alu with a reset that outputs all zero
  • make like an add/shift where sum=cin, cout=A

@stacksmith
Copy link
Author

stacksmith commented Oct 2, 2024

@pepijndevos: appreciate your writeup, and will work through it. Immediately have a question: you say "step 1 is put 0011 into the second nybble", then show the number as 1010_0011_1010_1100... so by second you mean bits 11:8?

What is your tooling to test these things? Do you just modify the json?

I'm coming from Xilinx, where I used the carry chain for all kinds of things. For instance, I would make an SRL16 shift register from a LUT, load a single bit into it, and push it out of COUT to create a pulse generator. I would stack two of them with mutually prime loop size, and use the carry from both to detect when both hit a 1, to replace really long counters in a single slice. With 5 Spartan3 slices I could generate a VGA signal from 100MHz -- hsync, vsync, front/back porch, start-scanline pulse. It seemed like it was impossible to get carry to do the right thing, but somehow there always was a way!

@pepijndevos
Copy link
Member

pepijndevos commented Oct 2, 2024

Yeah I filled the zeros. In this case I was counting the human way, left to right starting at 1.

I used to make ipython notebooks to try things out or yea just whatever it takes. Modify the json, modify the packer, modify nextpnr.

I think it might indeed be possible to build a shift register, which I kinda hinted at at the end of my ramblings.

@stacksmith
Copy link
Author

stacksmith commented Oct 2, 2024

@pepijndevos, so the SUM output is post the XOR? Is there a way to get at the LUT output?

The shift register you mention -- it would not cycle the contents of the LUT INIT, like Xilinx does, at least I don't see how. Without a register, feeding back in would be more like a ring oscillator going as fast as it can. With a register, you are back at one bit per register. Am I missing something? Oh, I think I see, you can do a multi-bit 'combinatorial' shift via carry without registers, but it doesn't really buy much...I don't think you can use the registers for anything else, can you?

I've read all I could about the so-called 'ram-based shift registers' in Gowin docs, but can't really make sense of them. They create encrypted IP, and seem to imply that you can use LUTs (as in memory?), registers, or blockram, but I don't quite see how it's anything other than what you can do with a few lines of obvious verilog.

Maybe there is some way to shift an SSRAM cell and I would love to see how. Xilinx was very generous exposing the shift machinery (which is probably used to configure the LUTs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants