CPU Multipliers

I’ve been working on my own RISC-V CPU to learn SpinalHDL and to get some hands on experience with formal verification.

While doing that, I started thinking about the 32x32 bit multiplier that’s part of some of today’s RISC-V cores.

There are different ways to go about this:

  • No HW Multiplier

    If you’re really area constraints, you simply don’t implement it at all, and you let the C compiler call a multiplier library function.

  • Iterative Multiplier

    If you want things to be a little faster, you implement an iterative multiplier which iterates through all the bits of one of the operands and conditionally adds a shifted version of the other operand. The picorv32 CPU code for that is here. And here’s the equivalent VexRiscV code.

    In most cases, this kind of multiplier will do one bit at a time, but that’s not really necessary. In the picorv32 code, you can specify this by setting the desired STEP_AT_ONCE parameter, or the mulUnrollFactor of the VexRiscV.

  • Parallel Multiplier

    If you really want performance, you uses a parallel multiplier which takes in that 2 operands completely and spits out a result. The number of cycles between input and output is often more than 1, but on a reasonably efficient and pipelined CPU (and in the absense of pipeline stalls), this can result in 1 result per clock cycle.

    The picrov32 CPU has the fast_mul module and the VexRiscV has the MulPlugin for that.

When you’re using a PicoRV32 or VexRiscv RISC-V CPU core, you can freely select which version to use.

I was mostly interested in the third option.

RISC-V Multiply Instructions

The RISC-V 32-bit instruction set has 4 multiply instructions:

  • MUL rd, rs1, rs2

    This multiplies operands rs1 and rs2 and stores the lower 32 bits of the result in rd.

    Note that there are no signed/unsigned variants: it doesn’t matter whether rs1 and rs2 are signed or unsigned because the result is the same. You can quickly see this on a 2-bit x 2-bit example:

      Unsigned: 2'b11 x 2'b11 =  3 x  3 =  9 = 4'b1001 -> 2'b01
      Signed:   2'b11 x 2'b11 = -1 x -1 =  1 = 4'b0001 -> 2'b01
    
      Unsigned: 2'b01 x 2'b11 =  1 x  3 =  3 = 4'b0011 -> 2'b11
      Signed:   2'b01 x 2'b11 =  1 x -1 = -1 = 4'b1111 -> 2'b11
    

    When the result is truncated to the same number of bits as the size of the operands, the outcome is identical.

  • MULH rd, rs1, rs2

    multiplies 2 signed operands rs1 and rs2 and stores the upper 32 bits in rd.

  • MULHU rd, rs1, rs2

    multiplies 2 unsigned operands rs1 and rs2 and stores the upper 32 bits in rd.

  • MULHSU rd, rs1, rs2

    multiplies signed operand rs1 and unsigned r operands2 and stores the upper 32 bits in rd.

For RISC-V, if you want to get the 64-bit result of a 32x32 bit multiplication, you need to use 2 instructions: MUL, and one of the MULH variants.

RTL Implementation on the PicoRV32

On the PicoRV32, when you strip away all the extraneous book keeping, the key part of the implementation is straightforward:

    rd <= $signed(rs1) * $signed(rs2);
    assign pcpi_rd = shift_out ? rd >> 32 : rd;

Of note is that rs1 and rs2 are 33-bit, not 32-bit, sized. As shown above in my 2x2 bit example, signed and unsigned operands require different multiplier behavior if you’re interested in the full result. One can avoid having different 3 different multipliers (one for each MULH variant) by expanding the operands with 1 bit additional MSB. This MSB is set to 0 for unsigned operands or a copy of the sign bit for signed operands.

The code above requires a 33-bit x 33-bit to 64-bit multiplier, with a multiplexer at the end to select the upper or lower 32 bits.

RTL Implementation on the VexRiscv

The relevant code can be found in MulPlugin.scala

    execute plug new Area {
      ...
      val aULow = a(15 downto 0).asUInt
      val bULow = b(15 downto 0).asUInt
      val aSLow = (False ## a(15 downto 0)).asSInt
      val bSLow = (False ## b(15 downto 0)).asSInt
      val aHigh = (((aSigned && a.msb) ## a(31 downto 16))).asSInt
      val bHigh = (((bSigned && b.msb) ## b(31 downto 16))).asSInt
      insert(MUL_LL) := aULow * bULow
      insert(MUL_LH) := aSLow * bHigh
      insert(MUL_HL) := aHigh * bSLow
      insert(MUL_HH) := aHigh * bHigh
    }

    memory plug new Area {
      insert(MUL_LOW) := S(0, MUL_HL.dataType.getWidth + 16 + 2 bit) + (False ## input(MUL_LL)).asSInt + (input(MUL_LH) << 16) + (input(MUL_HL) << 16)
    }


    writeBack plug new Area {
      val result = input(MUL_LOW) + (input(MUL_HH) << 32)

      switch(input(INSTRUCTION)(13 downto 12)){
        is(B"00"){
          output(REGFILE_WRITE_DATA) := input(MUL_LOW)(31 downto 0).asBits
        }
        is(B"01",B"10",B"11"){
          output(REGFILE_WRITE_DATA) := result(63 downto 32).asBits
        }
      }
    }

This looks more convoluted, because the multiplication has been been split up into 4 smaller multipliers. If we ignore the signed operations for a moment, we can split up a 32x32=64 operation into 4 16x16=32 operations:

    a[31:0] = { a1[15:0], a0[15:0] }
    b[31:0] = { b1[15:0], b0[15:0] }

    result = a * b = (a1<<16 + a0)(b1<<16 + b0) = a0*a0 + (a1*b0)<<16 + (a0*b1)<<16 + (a1*b1)<<32
                                                = a0*a0 + (a1*b0 + a0*b1)<<16       + (a1*b1)<<32

That’s what’s happening above.

The motivation of doing it this way is to spread the multiplication over multiple pipeline stages: the 4 16x16 multiplies happen in the execute stage, the addition of the 3 multiplications that contribute to the lower 32-bits happen in the memory stage. And the final addition (into the result signal) for the upper 32-bits happens in the writeBack stage.

Similar to the PicoRV32, 3 of the 4 16x16 multiplications are actually 17x17 bits to support both signed and unsigned operations. Most FPGAs from Xilinx and Intel/Altera have hard 18x18 multiplier macros in their DSP blocks, so it makes no difference to use 16x16 or 17x17 operations: they’re all mapped onto the same 18x18 hardware block anyway.

Multiplier Mapping onto FPGA

On relatively modern Altera FPGAs such as the Cyclone V series, there are 2 18x18 multipliers per DSP. In the best possible case, we’d only need 2 DSPs.

If we assume that Altera knows best how to best map larger multipliers onto smaller ones, then we can use the PicoRV32 mapping as an example on how to do things right.

After synthesis, what we see is that Quartus maps the logic to 3 DSPs instead of 2.

  • One DSP is used in (18x18 + 18x18) mode, where the output of two multipliers are added to eachother. This is the (a1*b0 + a0*b1) term that is implemented in discrete form on the VexRiscV.
  • Two DSPs are used in 18x18-only mode.

For the VexRiscv, the 4 multipliers are mapped to 4 DSPs.

Optimizing for MUL but not MULH*

In the vast majority of use cases, your C code will consist of int * int operations, where 2 32-bit integers are multiplied together and stores in a 32-bit integer as well. In other words: you’ll be using the MUL instruction.

If DSP resources are tight but you still want a fast multiplication for the most common operation, it makes sense then to optimize for that, and incur a performance penalty for MULH.

With a[31:0] = { a1[15:0],a0[15:0] } = a1a0 and b[31:0] = { b1[15:0],b0[15:0] } = b1b0, the full multiplication looks like this sum:

a*b=
        a0b0
      a1b0
      a0b1
    a1b1
   +========

   (a1b1)<<32 + (a1b0+a0b1)<<16 + a0b0

Note that each term axbx is 32 bits wide in the notation above, so you have 16 overlapping bits in these two sums: (a1b0+a0b1)<<16 + a0b0 and (a1b1)<<32 + (a1b0+a0b1)<<16.

If we only care about the bottom 32 bits of the result (MUL), it looks like this:

        |a0b0
      a1|b0
      a0|b1
    a1b1|
   +========

At the very minimum, we need 3 multiplications:

a0b0: 16x16 = 32 bits
a1b0: 16x16 = truncated to lower 16 bits
a0b1: 16x16 = truncated to lower 16 bits

And then we sum these 3 terms for the final 32 bit result.

So that’s the smallest fast implementation: 1 16x16=32 multiplier and 2 16x16=16 multipliers.

For an FPGA, there is no difference between a 16x16=16 or 16x16=32 multiplier so there is no savings on the DSP side. However, the final adder that sums the 3 terms together will only need to 32 bits. That’s very likely to be result in area and timing savings.

Even if we only optimize for MUL, we may still want to support MULH in hardware. If an acceptable trade-off is for this MULH to take 2 steps instead of 1, then it makes sense to first calculate the MUL with 3 16x16=32 bits operations in the first step, and calculate a 50-bit sum.

Then, in a second step, take the upper 18 bits [49:32] of this sum and add them to the 32-bit result of a1*b1.

We can reuse one of the 16x16 multipliers that were used in the first step, so the number of 16x16 multipliers is still 3 instead of the original 4.

For FPGAs that have hard 16x16 (or slightly larger) multipliers, that’s about the best we can do if we need a single step MUL.

But when using ASICs or FPGAs that don’t have hard multipliers (e.g. the Lattice iCE40 series), we can do better.

Reducing Logic Even More

Let’s imagine that we have 8x8 instead of 16x16 multipliers, and that

a[31:0] = { a3[7:0],a2[7:0],a1[7:0],a0[7:0] } = a3a2a1a0 and b[31:0] = { b3[7:0],b2[7:0],b1[7:0],b0[7:0] } = b3b2b1b0

A full 32x32 multiplication now looks like this:

a*b=
                a0b0
              a1b0
            a2b0
          a3b0
              a0b1
            a1b1
          a2b1
        a3b1
            a0b2
          a1b2
        a2b2
      a3b2
          a0b3
        a1b3
      a2b3
    a3b3

Rearranged:

            |    a0b0
            |  a1b0
            |  a0b1
            |a2b0
            |a1b1
            |a0b2
          a3|b0
          a2|b1
          a1|b2
          a0|b3
            |
        a3b1|
        a2b2|
        a1b3|
      a3b2  |
      a2b3  |
    a3b3    |

For a 32x32=32 result, we once again only care about terms with bits to the right of the vertical line, which means that the factors below the break are entirely not needed.

Using 16x16 multiplications, we could reduce the number of multiplications from 4 to 3 (-25%). When using 8x8 multiplications, we can reduce them from 16 down to 10 (-37.5%).

And if we still want to calculate MULH in 2 steps, then the result of the first step is 43 bits of which we can discard the lower 32 bits and roll over 11 bits to the second step.

We can arrange it still different:

            |    a0b0
            |  a1b0
            |  a0b1
            |a1b1       < swapped
            |a2b0       <
            |a0b2
          a3|b0
          a2|b1
          a1|b2
          a0|b3
            |
        a3b1|
        a2b2|
        a1b3|
      a3b2  |
      a2b3  |
    a3b3    |

The first 4 terms are a 16x16=32 multiplier, so we can implement this as 1 16x16=32bit multiplier and 6 8x8=16bit multipliers. If these kind of multipliers are available as a hard macro, this could be a solution.

Getting Ridiculous

If we can go from 16x16 to 8x8 multipliers, then we can also go even smaller, like 4x4=8, right?

Yes, that’s possible. But it’s not very useful. For one, there are no FPGAs that have these kind of small hard multipliers, so the exercise is academic. And second, when you start going this small, the timing path will long have moved from the multiplier to the adders that sum all these tiny terms together.

But for completeness, there’s how it works out:

a[31:0]=a7a6a5a4a3a2a1a0
b[31:0]=b7b6b5b4b3b2b1b0

a*b =
                    |            a0b0
                    |          a1b0
                    |        a2b0
                    |      a3b0
                    |    a4b0
                    |  a5b0
                    |a6b0
                  a7|b0
                    |          a0b1
                    |        a1b1
                    |      a2b1
                    |    a3b1
                    |  a4b1
                    |a5b1
                  a6|b1
                    |        a0b2
                    |      a1b2
                    |    a2b2
                    |  a3b2
                    |a4b2
                  a5|b2
                    |      a0b3
                    |    a1b3
                    |  a2b3
                    |a3b3
                  a4|b3
                    |    a0b4
                    |  a1b4
                    |a2b4
                  a3|b4
                    |  a0b5
                    |a1b5
                  a2|b5
                    |a0b6
                  a1|b6
                  a0|b7

                a6b2|
              a7b2  |
                a7b1|
                a5b3|
              a6b3  |
            a7b3    |
                a4b4|
              a5b4  |
            a6b4    |
          a7b4      |
                a3b5|
              a4b5  |
            a5b5    |
          a6b5      |
        a7b5        |
                a2b6|
              a3b6  |
            a4b6    |
          a5b6      |
        a6b6        |
      a7b6          |
                a1b7|
              a2b7  |
            a3b7    |
          a4b7      |
        a5b7        |
      a6b7          |
    a7b7            |

The terms are already separated into those that contribute to the lower 32 bits and those that do not.

36 out of 64 4x4=8 multipliers matter, and 28 do not. A reduction of 43.75% compared to the single step 32x32=64 bit multiplier.

As before, we could reduce some of those 4x4 terms into one 16x16=32 multiplier, and a number of 8x8=16 bit multipliers.

It should be obvious by now that smaller multiplier building blocks will result in a lower amount of logic. Taken to its logical conclusion, you can do this with 1x1 multipliers: the number of multipliers decreases towards 50% of the ones needed for a full 64 bit result, however the number of adders will go up.

If you’d want to code things up like this in RTL, using small multipliers building blocks is definitely not the best approach: there are much better ways to build fast multipliers. Looks up “Booth encoding” and “Wallace Tree.” Those are outside the scope of this post, and mostly irrelevant for FPGAs, since they require a lot of wiring and inefficient to map onto FPGA logic.

Floating Point Multiplier Multiplier Section

Finally, let’s have a quick look at floating point multipliers. Single precision FP32 has one sign bit, 8 exponent bits, and 23 fraction bits. The 23 fraction bits have an implied MSB that is set to 1, so the significand is really 24 bits.

To multiply 2 FP32 numbers, you’d think you need a 24x24=24 multiplier, but that’s not the case: since the implied MSB is set 1, you will always get a result for which the upper bits are set. So you need the upper 24 bits, or a few bits below that!

With a[23:0] = { a2[7:0],a1[7:0],a0[7:0] } = a2a1a0 and b[23:0] = { b2[7:0],b1[7:0],b0[7:0] } = b2b1b0, we get this:

              |  a0b0
              |a1b0
              |a0b1
            a2|b0
            a1|b1
            a0|b2
          a2b1|
          a1b2|
        a2b2  |

In this case, we’re only interested in the left side of the vertical bar, but that makes no difference in terms of the amount of logic that’s needed because the terms on the right side also influence the bits on the left.

So, no matter what, we need 9 8x8 multipliers, with the option to reduce to 1 16x16 multiplier and 5 8x8, as seen below:

              |  a0b0
              |a1b0
              |a0b1
            a1|b1
              |
            a2|b0
            a0|b2
          a2b1|
          a1b2|
        a2b2  |