Skip to content

Latest commit

 

History

History
576 lines (471 loc) · 23.4 KB

readme.md

File metadata and controls

576 lines (471 loc) · 23.4 KB

Local Register Allocator for 8086

Table of contents

What is this?

Motivation

Scope

Basic algorithm

8086-specific approach

Resources

What is this?

This is an implementation of a greedy bottom-up local register allocator for the intel 8086 CPU. It may be used as part of a simple compiler. In particular, it can be used in a C compiler whose int is 16-bit and char is 8-bit.

To make things interesting and interactive it comes with a code generator capable of generating 8086 assembly code (compilable into DOS .COM programs) from trees representing expressions with 16-bit integer ALU operations, constants and memory loads and stores. A few extra instructions generated by the code check the results of expression evaluation in the output register and the output memory locations, if any.

Thus you can actually execute the generated code and tweak things around to see how your changes affect code generation.

Sample assembly output from the register allocator/code generator:

;         7    (vr4)
;     mul     (vr5)
;         5    (vr3)
; add     (vr6)
;         3    (vr1)
;     mul     (vr2)
;         2    (vr0)
; ----
; Regs needed (approximately): 3
; --------
    ;                           ; vr0
    mov  ax, 2                  ; vr0
    mov  cx, 3                  ; vr1
    mul  cx                     ; vr2
    mov  cx, 5                  ; vr3
    mov  dx, 7                  ; vr4
    xchg ax, cx                 ; vr5
    mul  dx                     ; vr5
    add  cx, ax                 ; vr6
    ;                           ; vr6
    cmp  cx, 41                 ; vr6
    jne  failure                ; vr6

N.B. This register allocator will not allocate registers perfectly in all circumstances. It will occasionally generate unnecessary register moves and exchanges. Bear in mind, optimal register allocation is an NP-complete problem.

Motivation

The intel 8086 architecture has a number of instructions with fixed input and/or output registers and it's not entirely trivial to connect the ins and outs of these instructions. There's obviously a long history of implementing compilers for this and similar architectures, but it's a bit odd that it's hard to find a conceptually simple algorithm like this. The "Dragon" book and many other sources either consider architectures without instructions with such fixed register operands or solve the problem using graph coloring, which for certain reasons may be impractical.

The intel 8086 architecture survives to this day in the modern intel Pentium CPUs that are compatible descendants of the old 8086 and with it survive some of the instructions with fixed in/out register operands. So, the problem persists to this day, although there are fewer limitations to deal with when using newer 32-bit and 64-bit x86 instructions and registers today.

8086 registers by special uses in expression evaluation (pardon the mixture of C and assembly operators, instructions and syntax and ASCII art):

         +-- can be freely used in most ALU instructions as src/dst
         |   +-- can be used as address to access memory
         |   |     +-- has individual 8-bit subregisters
         |   |     |   +-- can be sign-extended with cbw
         |   |     |   |      +-- can be shifted
         |   |     |   |      |     +-- can be shift count
         v   v     v   v      v     v
   +-<&|^~  [r] 8bit cbw  <<dst <<cnt  *dst *src  /dvd /dvsr  /quo /rem
ax       +         +   +      +        fix+ fix+  fix+        fix+
bx       +   +     +          +                +           +
cx       +         +             fix+          +           +
dx       +         +          +                +                   fix+
si       +   +                +                +           +
di       +   +                +                +           +
                                          ^    ^     ^     ^     ^    ^
                         can be product --+    |     |     |     |    |
                         can be multiplicand --+     |     |     |    |
                                   can be dividend --+     |     |    |
                                          can be divisor --+     |    |
                                               can be quotient --+    |
                                                   can be remainder --+

Scope

Outside of the scope of this work are:

  • data types other than 8-bit bytes and 16-bit words (e.g. floats, structs)
  • conditional/ternary operators like in C/C++
  • function calls and calling conventions (however, a simple implementation is included)
  • use of immediate and memory operands directly in ALU instructions (we use separate load/store instructions instead)

These are left as an exercise to the reader.

Basic algorithm

Before we get to the 8086 specifics, let's describe the algorithm for an architecture, where there aren't instructions with fixed register operands. We'll be building on top of it.

Let's also consider a case, where our expressions consist of only binary operators (unary operators will be a trivial specialization), which are realized as 3-register instructions in the CPU, that is, 2 input registers and 1 output register, much like in the MIPS architecture (e.g. sub r2, r3, r4 will do r2 = r3 - r4;).

Suppose we have an expression tree like this:

              op
              -
      op              op
      +               |
  op      op      op      op
  ^       *       /       %
nm  nm  nm  nm  nm  nm  nm  nm
1   2   3   4   5   6   7   8

op and nm represent operator and numeric nodes in the tree.

Our hardware register allocation will be done recursively from the root towards the leaves:

AllocHRegs(node):
  recursively call AllocHRegs() for both input/child nodes
  Ensure() that both input/child values are in registers
  Free() both input registers
  Allocate() one output register
  generate an instruction using these 3 register operands

For the numeric leaf node there will only be the last two steps: Allocate() for the output register and the generation of an instruction to load an integer constant into that register.

The utility functions will be:

Ensure(node):
  if node's value has no location yet:
    Allocate() a register for it and return the register
  else if node's value is already in a register:
    return that register
  else: /* the node's value must have been spilled onto the stack */
    Allocate() a register
    generate a pop instruction to pop a value from
      the stack into this register
    return this register

and

Allocate(node):
  if there's a vacant register among the N registers the allocator manages:
    mark it as holding the node
    mark the node as being in this register
    return the register
  else:
    from the N registers managed by the allocator find the register whose
      value won't be needed the longest, that is, whose parent/using node
      is the most distant
    mark the node that that register holds as spilled
    generate a push instruction to push the node's value from the
      register onto the stack
    mark the register as holding the argument node passed into Allocate()
    mark the node as being in this register
    return the register

and

Free(hardware_reg):
  mark the node this register holds as not having a location
  mark the register as holding no node

Using this algorithm we will arrive at this code generated for the above tree (let's replicate the tree here):

              op
              -
      op              op
      +               |
  op      op      op      op
  ^       *       /       %
nm  nm  nm  nm  nm  nm  nm  nm
1   2   3   4   5   6   7   8

mov  hr0, 1
mov  hr1, 2
xor  hr0, hr0, hr1
mov  hr1, 3
mov  hr2, 4
mul  hr1, hr1, hr2
add  hr0, hr0, hr1
mov  hr1, 5
mov  hr2, 6
idiv hr1, hr1, hr2
mov  hr2, 7
mov  hr3, 8
irem hr2, hr2, hr3
or   hr1, hr1, hr2
sub  hr0, hr0, hr1

This code needs 4 registers (hr0 through hr3).

Spilling

If we restrict the number of registers that the allocator can manage to 3, we'll get this instead (let's replicate the tree one more time):

              op
              -
      op              op
      +               |
  op      op      op      op
  ^       *       /       %
nm  nm  nm  nm  nm  nm  nm  nm
1   2   3   4   5   6   7   8

mov  hr0, 1
mov  hr1, 2
xor  hr0, hr0, hr1
mov  hr1, 3
mov  hr2, 4
mul  hr1, hr1, hr2
add  hr0, hr0, hr1
mov  hr1, 5
mov  hr2, 6
idiv hr1, hr1, hr2
mov  hr2, 7
push hr0
mov  hr0, 8
irem hr0, hr2, hr0
or   hr0, hr1, hr0
pop  hr1
sub  hr0, hr1, hr0

Note the appearance of the push and pop instructions in the above code and the disappearance of register hr3.

The allocator runs out of registers once it has loaded 7 into hr2 and is about to load 8 into another register. At this point there are 3 partial results occupying all 3 registers:

hr0 = result of +
hr1 = result of /
hr2 = 7

We spill hr0, the result of +, onto the stack to make hr0 available for loading of 8. We choose hr0 to spill because its parent, -, is the most distant (| and %, the parents of / and 7, are closer). When we finally need the result of +, we pop it from the stack.

The relative distance of parent nodes can be obtained by simple recursive assignment of unique numbers to all nodes:

                     op
                     -
         op                      op
         +                       |
   op          op          op          op
   ^           *           /           %
nm    nm    nm    nm    nm    nm    nm    nm
1     2     3     4     5     6     7     8

0  2  1  6  3  5  4  14 7  9  8  13 10 12 11 <- node number / virtual reg

Each node then will have this unique number (we may refer to it as the virtual register) and we can either follow the node's parent link to find the parent's unique number or we may simply store the parent's unique number in the child node instead of storing there the link.

And so in this example of ours, 14 is the largest of the three numbers (14, 13 and 12), which is how we decide to kick the hr0 value out onto the stack.

Spilling the register that has the most distant use guarantees that unspilling will occur in the exact opposite order (LIFO), which lets us use the stack with push and pop instructions. Specifically, with operators being at most binary (not ternary and so on), we'll never spill two sibling nodes, that is, we'll never have to distinguish equally distant nodes.

When implementing this algorithm we'll need a global array of N pointers to the expression tree nodes, where N is the number of the registers that the register allocator manages. (This array is known as NodeFromHReg[] in our code.) By examining the contents of this array we'll know if a particular hardware register is occupied by the result of some node. The nodes themselves will carry the location of their results (that is, it can be one of these N hardware registers or the stack or nowhere yet). IOW, with such a setup we'll always be able to tell what's where, by examining the array or a node and we'll be modifying the array and nodes appropriately to reflect the current allocations throughout the process.

Evaluation order of subexpressions

When evaluating a binary operator (e.g. a + b) we should first evaluate the child/subexpression that uses more registers, then the one that uses fewer. This leads to more effective register use and fewer spills. The logic is simple: before you can evaluate the binary operator, you need to hold the result of one of its children in a register while evaluating the other child. So, if the children need, let's say, 1 and 2 registers each to evaluate, then evaluating first then second will need 3 registers, whereas evaluating them in the opposite order will need 2 registers.

When assigning unique node numbers (virtual register numbers) as described in the preceding section, we can assign them in this order and we can then handle sibling nodes in this order by looking at and comparing their unique numbers.

See Ershov number and Sethi-Ullman algorithm.

2-register instructions

Not all CPU architectures offer 3-register ALU instructions like the MIPS does. The x86 architecture's ALU instructions are 2-register. That is, the output overwrites one of the inputs.

This affects the previously described AllocHRegs() function a little. It has to be able to allocate a specific output register through the Alloc() function such that Alloc() would allocate the same register as one of the inputs. So, for example, if the inputs to the add instruction are in, say, ax and cx, the output will be in ax if we generate add ax, cx (or the other way around, the output will be in cx if we generate add cx, ax; keep in mind that not all binary operators are symmetric).

8086-specific approach

The 8086-specific idea is:

  • When an instruction at a node needs its inputs in specific registers, it requests its child nodes to produce their outputs in desired registers (this is done recursively through the familiar AllocHRegs() function). However, they may be unable to satisfy the requests due to their own limitations (that is, they can never produce the outputs elsewhere naturally) or because the desired registers already hold something useful. And so the child nodes do what they can and leave it to their parent to do the rest when they can't satisfy the request.
  • Thankfully, it's always possible to swap values in a few registers with the xchg instruction to satisfy the needs on the input side and such swaps aren't needed too often.

So the AllocHRegs(), Ensure() and Allocate() functions introduced earlier need to receive an additional parameter, the desired register. It should be possible to use this parameter to request a specific register, any register or a register from a specific subset of all allocatable registers.

Currently we're using this enumeration to specify a register:

enum HReg : int
{
  // Specific regs begin
  HReg0, HRegAX = HReg0,
  HReg1, HRegCX = HReg1,
  HReg2, HRegDX = HReg2,
  HReg3, HRegBX = HReg3,
  HReg4, HRegSI = HReg4,
  HReg5, HRegDI = HReg5,
  // Specific regs end
  HRegCnt,
  // Constants representing multiple-choice desired regs:
  HRegAny = HRegCnt, // unspecified, any reg at all is OK
  HRegNotCX, // prefer regs other than cx (for shifts)
  HRegNotDXNotAX, // prefer regs other than dx and ax (for (i)div)
  HRegNotDXNotCXNotAX, // prefer regs other than dx, cx and ax
  HRegByte, // prefer those with individual byte components: ax, cx, dx, bx
  HRegByteNotCX, // prefer those with individual byte components except cx: ax, dx, bx
  HRegAddr, // prefer those that can be a memory operand: bx, si, di
};

The Allocate() function will try to allocate the requested/desired register first, but if it fails to, it'll return a different one and the caller will need to deal with it (e.g. use the xchg instruction).

Shifts

The 8086 shift instructions (shl, shr, sar) take the shift count from the fixed register cl (lower half of cx) if the count isn't a simple 1. The value that they shift can be anywhere else.

Hence the implementation of AllocHRegs() for shift instructions should pass its own desired register parameter (possibly modified to exclude cx) to the left child's AllocHRegs() and Ensure() while it should pass HRegCX as the desired register to the right child's AllocHRegs() and Ensure(). Btw, when both children need the same number of registers to compute we should prefer to handle the right/count child first, to have higher chances of allocating cx for the count.

       desired register may pass modified to left child
       |
       v
shift dst, cl
       |    |
       |    v
       v    desired reg = cx
       desired reg excludes cx

Ensure() may fail to allocate and return HRegCX for the right child. When this happens, we either move the right child node from that other register to cx (if cx is vacant) or we exchange the registers between that node and the one currently residing in cx (when cx isn't vacant). For this we update the node(s) and the global array of pointers to nodes (NodeFromHReg[]) and generate the mov or xchg instruction. We should also remember that Ensure() may allocate cx for the left child, in which case we need to properly update the left register.

         ^
         |
  shift dst, cl
         ^    ^
         |    |
case 1: ?x   cx  Nothing to do
case 2: cx   ?x  Swap left and right
case 3: ?x   ?x  Swap right and cx (or move right to cx)

(i)mul

The (i)mul instruction multiplies ax by another input and outputs the product to a pair of registers, dx:ax (most significant bits:least significant bits). That is, the product is twice as wide as either multiplicand. However, for our purposes, for 16-bit multiplicands whose product isn't expected to need more than 16 bits, we can treat dx as a clobbered register.

(i)mul therefore disregards its caller's desired register since (i)mul always outputs to ax. (i)mul, being symmetric, also asks both its child/input nodes to produce their results in ax in the hopes that at least one would end up in ax.

        desired reg is ignored because output is in ax
        |
        v
(i)mul ax, src
        |   |
        |   v
        v   desired reg = ax
        desired reg = ax

If either of (i)mul's inputs ends up in dx, it's kept there, since it can be conveniently clobbered. However, if none of the inputs is in dx, dx may need to be explicitly preserved (if dx holds some other partial result that we don't want (i)mul to clobber).

To preserve dx we exchange the registers between one of the input nodes and the node in dx. IOW, we reduce this to having one of the inputs in dx, which is safe to clobber.

So, through at most 2 register exchanges we can make one input in ax and the other, if needed, in dx.

N.B. the 16 least significant bits of a product of two 16-bit integers are the same for both signed and unsigned multiplication and so on the x86 we can use mul and imul interchangeably if we're interested in just those 16 bits of the product.

(i)div

Computing quotients and remainders with (i)div is overall trickier.

The instruction divides a pair of registers, dx:ax (most significant bits: least significant bits), that is, a 32-bit integer, by a 16-bit input and the outputs are in dx (the remainder) and ax (the quotient). When we need to divide 16-bit integers we zero- or sign-extend the dividend from 16 bits to 32 bits, filling dx with the extension. dx being part of the dividend naturally prevents us from using dx for the divisor.

But first, we need the dividend in ax, for which we pass HRegAX as the desired register into AllocHRegs() and Ensure() for the left input node. We use HRegNotDXNotAX (meaning any register but dx or ax) as the desired register for the right input node.

         desired reg is ignored because output is in ax or dx
         |
         v
(i)div dx:ax, src
           |   |
           |   v
           v   desired reg excludes dx and ax
           desired reg = ax

Then through at most one xchg we can make sure that the dividend is indeed in ax.

If we're unlucky and Ensure() (or the xchcg) gives us dx for the divisor or dx contains some other partial result, we Allocate() another register (and immediately Free() it, IOW, we just find an available register, spilling if necessary) and move to it whatever node is in dx, thereby freeing dx from anything useful.

Finally, when we get to allocate the output register for (i)div, we allocate ax if we're interested in the quotient or we allocate dx if we're interested in the remainder.

N.B. when computing how many registers an (i)div-based division/remainder node needs, we factor in the above restriction around dx. This raises the minimum from 2 to 3 registers, but otherwise the computation remains the same as for e.g. add. The order of evaluation and the node numbering described earlier should take this into account.

Loads

To load a value from memory we need the memory address in one of 3 suitable registers: bx, si, di. So, the child/input node supplying the address will get its AllocHRegs() and Ensure() called with HRegAddr as the desired register. If we're unlucky to get the address in one of those registers, we can use xchg, as usual, to bring the address into e.g. bx.

A 16-bit value can be loaded into any 16-bit register, but an 8-bit value can only be loaded into individual 8-bit subregisters, e.g. al, bl, cl, dl (the desired register can be set to HRegByte when we'd like a register with 8-bit subregisters). Further, if we intend to sign-extend an 8-bit value from memory to 16 bits with the cbw instruction, we need to load the value into al.

On top of that we want to load the value into the desired register if we can do it directly. So, before we call Allocate() to allocate the output register we're going to examine the desired register value and see if it's available and suitable. Again, if Allocate() doesn't give us a register with 8-bit subregisters when we need one, we may use xchg to get us ax/al for the output register.

    desired reg may be ignored and 
    chosen from regs that have 8-bit subregs
    |
    v
mov r, [r]
        |
        v
        desired reg can address memory

Stores

Stores are a bit simpler than loads, but they have the same fundamental requirements of the memory address being in one of the 3 suitable registers (bx, si, di) and the 8-bit value being in one of the 4 suitable registers (ax/al, bx/bl, cx/cl, dx/dl) and we may need to mov or xchg to satisfy these requirements.

         desired reg may be ignored
         |
         v
mov [r], r
     |   |
     |   v
     v   desired reg may need to have 8-bit subregs
     desired reg can address memory

N.B. the store node's result (besides the direct effect of writing to memory) can be the value that's being stored similar to how we can write a = b = c; in C/C++ to use the result of one assignment (b = c) as a value for another assignment (a = ...).

Resources

  • See Ershov number, Sethi-Ullman algorithm.
  • Engineering a Compiler 2nd ed. by Keith D. Cooper and Linda Torczon. In particular, section 13.3.2 "Bottom-Up Local Register Allocation".
  • Compiler Construction by William M. Waite and Gerhard Goos. In particular, section 10.2.2 "Targeting".
  • Compilers Principles, Techniques, and Tools (AKA the "Dragon" book) by Alfred V. Aho, Ravi Sethi and Jeffrey D. Ullman.