You want to implement a shift register to feed data in serially. In general you should use case statements only for control logic and not data flow. Treating the data vector as an addressable array is also possible but it synthesizes to a decoder which will become a timing problem at this size and isn't strictly necessary for just moving bits into place.
signal data : std_logic_vector(127 downto 0);
...
sreg: process(clock, reset)
begin
if reset = '1' then
data <= (others => '0');
elsif rising_edge(clock) then
if shift_en = '1' then
data <= data(126 downto 0) & a; -- Shift left
-- data <= a & data(127 downto 1); -- Shift right
end if;
end if;
end process;
You will have to decide how to control what you do when the shift register is filled. Either implement a counter to count off 128 shifts or use another control signal that starts the next stage of processing when shifting is done.
On the test bench side you have a lot more flexibility in how you drive signals since there are no concerns about synthesis results. You generally have two options: write synchronous processes similar in style to the DUT or use wait statements to manage the order of signalling without implementing the synchronous mechanisms needed in synthesizable code.
constant CPERIOD : delay_length := 10 ns;
...
stim: process is
variable data : std_logic_vector(127 downto 0);
begin
-- Initialize signal drivers
a <= '0';
shift_en <= '0';
reset <= '1', '0' after CPERIOD * 2;
wait until falling_edge(clock);
wait for CPERIOD * 2;
data := X"CAFEBABECAFED00D8BADFOODDEADBEEF";
-- Shift data in from left to right
shift_en <= '1';
for i in data'range loop
a <= data(i);
wait for CPERIOD;
end loop;
shift_en <= '0';
wait for CPERIOD;
wait; -- Stop process from restarting
end process;
Note: Driving the stimulus on the falling edge of the clock is a lazy technique that dodges any issues with delta cycle ordering when you drive on the same edge as the receiver. It is not always appropriate to do that when you want to represent accurate timing of signals but guarantees you won't have to wrestle with the simulation engine processing events in a different order than you intended. Definitely don't do it in synthesizable code (unless you're experimenting with domino logic).