Pipelining in a polynomial

Question

I suggest, first try to do this without pipelining, in one clock cycle, and then break it into multiple stage.

The design depends on several factors:

How much resources you want to allocate (affects area/power)
What is your clock cycle time? Since multipliers are slow and expensive, you don't want to connect many of them back o back.
What is your throughput? Do you want a result per clock cycle, or a result per N number of clock cycles (this way you can do resource sharing).

Here is an example: Lets assume you can tolerate the delay of only two consecutive multipliers per clock cycle, and you want a throughput of one polynomial per clock cycle. Your pipeline structure can be like this:

stage 1: inputs: {a5,...,a0,x}

Combination circuit:

{a5,...,a0,x}---------------------->{a5,...,a0,x}
                        |-->------->x^2
            x->[mult]->x^2->[mult]->x^3

stage 2: inputs: {a5,...,a0,x,x^2,x^3}

Combination circuit:
{a5,...,a0,x,x^2,x^3}------------------------>{a5,...,a0,x,x^2,x^3}
                                  |-->------->x^4
                    x^3->[mult]->x^4->[mult]->x^5

stage 3: inputs: {a5,...,a0,x,x^2,x^3,x^4,x^5}

Combination circuit:
(a0,x^0)->[mult]->a0x^0--\      
(a1,x^1)->[mult]->a1x^1--\ 
(a2,x^2)->[mult]->a2x^2-->[sum]-> a0x^0+a1x^1+...+a5x^5      
(a3,x^3)->[mult]->a3x^3--/      
(a4,x^4)->[mult]->a4x^4--/      
(a5,x^5)->[mult]->a5x^5--/

Notice that we are using a lot of resources (multipliers) in order to achieve one result per cycle throughput.

If you can't have more than one multiplier per stage, you need to break stages 1 and 2 into two stages each.