I suggest, first try to do this without pipelining, in one clock cycle, and then break it into multiple stage.
The design depends on several factors:
- How much resources you want to allocate (affects area/power)
- What is your clock cycle time? Since multipliers are slow and expensive, you don't want to connect many of them back o back.
- What is your throughput? Do you want a result per clock cycle, or a result per N number of clock cycles (this way you can do resource sharing).
Here is an example: Lets assume you can tolerate the delay of only two consecutive multipliers per clock cycle, and you want a throughput of one polynomial per clock cycle. Your pipeline structure can be like this:
stage 1: inputs: {a5,...,a0,x}
Combination circuit:
{a5,...,a0,x}---------------------->{a5,...,a0,x}
|-->------->x^2
x->[mult]->x^2->[mult]->x^3
stage 2: inputs: {a5,...,a0,x,x^2,x^3}
Combination circuit:
{a5,...,a0,x,x^2,x^3}------------------------>{a5,...,a0,x,x^2,x^3}
|-->------->x^4
x^3->[mult]->x^4->[mult]->x^5
stage 3: inputs: {a5,...,a0,x,x^2,x^3,x^4,x^5}
Combination circuit:
(a0,x^0)->[mult]->a0x^0--\
(a1,x^1)->[mult]->a1x^1--\
(a2,x^2)->[mult]->a2x^2-->[sum]-> a0x^0+a1x^1+...+a5x^5
(a3,x^3)->[mult]->a3x^3--/
(a4,x^4)->[mult]->a4x^4--/
(a5,x^5)->[mult]->a5x^5--/
Notice that we are using a lot of resources (multipliers) in order to achieve one result per cycle throughput.
If you can't have more than one multiplier per stage, you need to break stages 1 and 2 into two stages each.