VHDL beginner - what's going wrong wrt to timing in this circuit?

Question 1

Did your timing report indicate that you had a timing problem? It looks to me like you were just rolling through the segment values extremely fast. No matter how well you design for higher clock speeds, you're rotating cur_anode every clock cycle, and therefore your display will change accordingly. If your clock is too fast, the display will change much faster than a human would be able to read it.

Some other suggestions:

You should split your single process into separate clocked and unclocked processes. It's not that what you're doing won't end up synthesizing (obviously), but it's unconventional, and may lead to unexpected results.
Your initialization on cur_seg won't really do anything, as it's always driven (combinationally) by your process. It's not a problem - just wanted to make sure you were aware.

Question 2

Well there are two parts to this.

Your segments appeared so dimly because you are basically running them at a 1/8th duty cycle at a faster rate than the segments have time to react(every clock pulse you are changing which segment is lit up and then you stop driving it on the next pulse).

By increasing the period your segments got brighter by switching from a transient current (segments need time to ramp up) to a steady state current (longer period lets current go to desired levels when you drive the segments slower than their inherent driving frequency). Hence the brightness increase.

One other thing about your code. You may be aware of this, but when you latch with your clock there, the variable labeled cur_anode is advanced and actually represents the NEXT anode. You also latch ANODE and SEGMENT to the current anode and segment respectively. Just pointing out that the cur_anode may be a misnomer (and is confusing because its usually the NEXT one).

Question 3

Keeping in mind Paul Seeb's and fru1bat's answers on clock speed, Paul's comment on NEXT anode, and fru1bat's suggestion on separating clocked and un-clocked processes as well as your noting that you had 8 ROMs, there are alternative architectures.

Your architecture with a ring counter for ANODE and multiple ROMs happens to be optimal for speed, which as both Paul and fru1bat note isn't needed. Instead you can optimize for area.

Because the clock speed is either external or controlled by the addition of an enable supplied periodically it isn't addressed in area optimization:

architecture foo of BCDTo7SegDriver is
    signal digit:   natural range 0 to 7;            -- 3 bit binary counter
    signal bcd:     std_logic_vector (3 downto 0);   -- input to ROM
begin

UNLABELED:
    process (CLK) 
    begin
        if rising_edge(CLK) then

            if digit = 7 then       -- integer/unsigned "+" result range 
                digit <= 0;         -- not tied to digit range in simulation
            else
                digit <= digit + 1;
            end if;

        SEGMENT_REG:
            SEGMENT <= BCD_TO_DEC7(bcd);  -- single ROM look up

        ANODE_REG:
            for i in ANODE'range loop
                if digit = i then
                    ANODE(i) <= '0';
                else
                    ANODE(i) <= '1';
                end if;
            end loop;
        end if;        
    end process;

BCD_MUX:    
    with digit select 
        bcd <= VAL(3 downto 0)   when 0,
               VAL(7 downto 4)   when 1,
               VAL(11 downto 8)  when 2,
               VAL(15 downto 12) when 3,
               VAL(19 downto 16) when 4,
               VAL(23 downto 20) when 5,
               VAL(27 downto 24) when 6,
               VAL(31 downto 28) when 7;

end architecture;

This trades off a 32 bit register (cur_val), an 8 bit ring counter (cur_anode) and seven copies of the ROM implied by function BCD_TO_DEC7 for a three bit binary counter.

In truth the argument over whether or not you should be using separate sequential (clocked) and combinatorial (non clocked) processes is somewhat reminiscent of Liliput and Blefuscu going to war over Endian-ness.

Separate processes generally execute a little more efficiently due to not sharing sensitivity lists. You could also note that all concurrent statements have process or block statement equivalents. There's also nothing in this design that can take particular advantage of using variables which can result in more efficient simulation while implying a single process. (Shared variables aren't supported by XST).

I haven't verified this will synthesize but after reading through the 14.1 version of the XST user guide think it should. If not you can convert digit to a std_logic_vector with a length of 3.

The + 1 for digit will get optimized, an incrementer is smaller than a full adder.