SpinalHDL Automated Operand Latency Matching

Imagine you’re doing the following calculation:

q = a + b + c

However, all your math functional blocks only accept 2 operands. No problem, just split it up:

p = a + b
q = p + c

That’s easy to do when you write C or maybe use high-level synthesis language (HLS).

But when you write RTL and when each operation take a number of cycles, you need to be careful: if your pipeline accepts a new input each clock cycle, you need to make sure that the operands for all functional blocks are aligned!

For example: if the adder take 2 clock cycles, the result p will emerge after 2 clock cycles and you need to delay c by 2 clock cycles to align it to ‘p’ before you can apply both operands to the adder that calculats ‘q’.

My parameterizable precision fpxx library has tons of flexilibity in terms of pipeline depth. The FpxxAdd block can be configured to take between zero and 5 pipeline stages. You select the amount depending on your clock frequency needs.

The adder is declared like this:

case class FpxxAddConfig(
    pipeStages      : Int = 1

class FpxxAdd(c: FpxxConfig, addConfig: FpxxAddConfig = null) extends Component {

    def pipeStages      = if (addConfig == null) 1 else addConfig.pipeStages

And here is how to instantiate an adder with 5 pipeline stages:

    val fp_op = new FpxxAdd(config, FpxxAddConfig(pipeStages = 5))

Other blocks have similar levels of configurability.

Larger functional blocks such as ray/sphere intersection have tons of different operations that are both cascaded sequentially and working in parallel.

Sphere Intersection Pipeline

Even if the latencies through each core math block were fixed, it’d still be a real pain to ensure that all operands to all blocks were correctly latency aligned.

Comes to the rescue: the SpinalHDL LatencyAnalysis function!

Its function is a simple as it is brilliant: given a set of signals that are connected to each other through a string of combinatorial and sequential logic, it returns the minimum number of clock cycles to travel through all nodes.

The fpxx library has a op_vld input result_vld output for each core operation, which ultimately strings all operations together, from the input of the pipeline to the output.

Now check out this helper function:

object MatchLatency {

    // Match arrival time of 2 signals with _vld
    def apply[A <: Data, B <: Data](common_vld: Bool, a_vld : Bool, a : A, b_vld : Bool, b : B) : (Bool, A, B) = {

        val a_latency = LatencyAnalysis(common_vld, a_vld)
        val b_latency = LatencyAnalysis(common_vld, b_vld)

        if (a_latency > b_latency) {
            (a_vld, a, Delay(b, cycleCount = a_latency - b_latency) )
        else if (b_latency > a_latency) {
            (b_vld, Delay(a, cycleCount = b_latency - a_latency), b )
        else {
            (a_vld, a, b)

This function accepts a common/root valid and 2 valid/value pairs A and B. It calculates the latency the common valid to the 2 A and B pairs and then inserts pipeline delays for either the A or the B pair so that they are now aligned to each other.

Here’s an example how this is used in the code:

    val (common_dly_vld, tca_tca_dly, c0r0_c0r0_dly) = MatchLatency(
                                                tca_tca_vld,   tca_tca,
                                                c0r0_c0r0_vld, c0r0_c0r0)

    val d2_vld = Bool
    val d2     = Fpxx(c.fpxxConfig)

    val u_d2 = new FpxxSub(c.fpxxConfig, Constants.fpxxAddConfig)
    u_d2.io.op_vld <> common_dly_vld
    u_d2.io.op_a   <> c0r0_c0r0_dly
    u_d2.io.op_b   <> tca_tca_dly

    u_d2.io.result_vld <> d2_vld
    u_d2.io.result     <> d2

Thanks to MatchLatency and LatencyAnalysis, tca_tca_dly and c0r0_c0r0_dly are now latency aligned, with a common_dly_vld as their common valid signal!

I can change the pipeline depth of each math operation at will, and the the whole pipeline adjust automatically, inserting delays as needed.

LatencyAnalysis is absolutely brilliant and is a life saver when you design a pipeline that, in my case, ended up to have more than 100 stages.

Further improvements are possible: right now, I have a separate _vld signal for the operations. A more canonical SpinalHDL way would be to wrap operands in a generic Flow object. But that’s an improvement for later.