Question

In an university course about configurable embedded systems (on ZYNQ-7010) we were recently implementing a (naive) low-pass image filter that would apply a 1-dimensional gaussian kernel (0.25*[1 2 1]) to data coming from block RAM.

We decided to cache (i.e. queue) three pixels and then operate on them on-line in the data output process. Our first approach was to have three process variables and have them roll over in a

pixel[k-2] := pixel[k-1];
pixel[k-1] := pixel[k];
pixel[k]   := RAM(address);

fashion; The following being the full process:

process (clk25)
    -- queue
    variable pixelMinus2  : std_logic_vector(11 downto 0) := (others => '0');
    variable pixelMinus1  : std_logic_vector(11 downto 0) := (others => '0');
    variable pixelCurrent : std_logic_vector(11 downto 0) := (others => '0');

    -- temporaries
    variable r : unsigned(3 downto 0);
    variable g : unsigned(3 downto 0);
    variable b : unsigned(3 downto 0);
begin
    if clk25'event and clk25 = '1' then
        pixelMinus2  := pixelMinus1;
        pixelMinus1  := pixelCurrent;
        pixelCurrent := RAM(to_integer(UNSIGNED(addrb)));

        IF slv_reg0(3) = '0' THEN 
            -- bypass filter for debugging
            dob <= pixelCurrent;
        ELSE
            -- colors are 4 bit each in a 12 bit vector
            -- division by 4 is done by right shifting by 2
            r := (
                          ("00" & unsigned(pixelMinus2(11 downto 10)))
                        + ("00" & unsigned(pixelMinus1(11 downto 10)))
                        + ("00" & unsigned(pixelMinus1(11 downto 10)))
                        + ("00" & unsigned(pixelCurrent(11 downto 10)))
                    );

            g :=  (
                          ("00" & unsigned(pixelMinus2(7 downto 6)))
                        + ("00" & unsigned(pixelMinus1(7 downto 6)))
                        + ("00" & unsigned(pixelMinus1(7 downto 6)))
                        + ("00" & unsigned(pixelCurrent(7 downto 6)))
                    );

            b :=  (
                          ("00" & unsigned(pixelMinus2(3 downto 2)))
                        + ("00" & unsigned(pixelMinus1(3 downto 2)))
                        + ("00" & unsigned(pixelMinus1(3 downto 2)))
                        + ("00" & unsigned(pixelCurrent(3 downto 2)))
                    );

            dob <= std_logic_vector(r) & std_logic_vector(g) & std_logic_vector(b);
        END IF;
    end if;
end process;

However this turned out to be horribly wrong; Synthesis would take ages and result in an estimated LUT usage of approximately 130% of the device's capability.

We later changed the implementation to using signals instead of variables and this resolved all problems; The hardware behaved as expected and LUT usage went down to some percent.

My question is what's causing the problem here when using variables as, from our understanding, it should work like that.

Was it helpful?

Solution 2

When a variable is used for pixelCurrent in the process, then the value is updated and available immediately, where the value of a signal is not ready until the next cycle.

So when a variable is use, this line implements a RAM with asynchronous read based on addrb:

pixelCurrent := RAM(to_integer(UNSIGNED(addrb)));

Where an assign to a signal will implements a RAM with synchronous read, where the value read from the RAM is not available until next cycle.

The typical FPGA technologies has dedicated hardware for RAMs with synchronous read, but RAMs with asynchronous are made with combinatorial logic (look up tables / LUT).

So the huge amount of LUTs that appears when using a variable for pixelCurrent is because the synthesis tool tries to map the RAM with asynchronous read into LUTs, which typically requires a huge amount of LUTs and makes the resulting RAM very slow.

In the pipelined design it sounds like the asynchronous RAM read is not required, so if pixelCurrent is a signal, a synchronous RAM is used instead and the synthesis tool will map the RAM to an internal RAM hardware block, with code like:

pixelMinus2  := pixelMinus1;
pixelMinus1  := pixelCurrent;
pixelCurrent <= RAM(to_integer(UNSIGNED(addrb)));

OTHER TIPS

Signals, being the means of inter-process communication, have assignment semantics carefully designed to avoid race conditions and hazards. See this Q&A and this link to "VHDL's crown jewel" for the gory details.

Therefore when you assign pixelCurrent (signal)

pixelCurrent <= RAM(to_integer(UNSIGNED(addrb)));

the assignment doesn't happen until the process suspends (which for RTL code is typically when the process exits and at the sensitivity list), and the result is not available within this process until it next wakes up at if rising_edge(clk25). So this creates a pipeline register.

Variables within a VHDL process act like variables in a process in any other imperative language (C etc) - once updated, their new value is immediately available.

Therefore the following:

pixelCurrent := RAM(to_integer(UNSIGNED(addrb)));

IF slv_reg0(3) = '0' THEN 
    -- bypass filter for debugging
    dob <= pixelCurrent;

propagates the NEW value of pixelCurrent into the rest of the process, generating a HUGE design which tries to accomplish EVERYTHING within a single clock cycle.

There are two solutions : my preferred one is to use signals for pipeline registers, because you can describe the pipeline in its most natural manner (with the first stage first).

The second solution, using variables as pipeline registers - ironically you already partially adopt this solution -

pixelMinus2  := pixelMinus1;
pixelMinus1  := pixelCurrent;
pixelCurrent := RAM(to_integer(UNSIGNED(addrb)));

is to describe the pipeline BACKWARDS so that the assignment to a variable comes after the last use of its value.

Simply move these three assignments after the big IF slv_reg0(3) and your variable version should work.

Having verified that both approaches generate the same hardware, pick whichever approach you think leads to the clearest (easiest to understand) design.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top