As earlier commented, a parallel tree will have the best performance as you widen the comparator. For a comparator as narrow as 8 bits, however, routing delays can dominate and the Cyclone II will perform better using its carry chains (see section 2-2 of the Cyclone II device handbook) since they connect directly to neighboring LEs. This is why serial logic can outperform parallel.
As for rising_edge, you've written a mix of two conventions. Before rising_edge was standard, the same function was performed using clk'event and event='1'
; since rising_edge already defines the new state as '1' there's no need to test it. Testing for a high level alone, on the other hand, produces not a D flip-flop but a transparent latch - a rarely desired function most FPGAs are not optimized for, and the synthesis tools tend to warn about this.
As for your timing results, without seeing the test method I can't read anything from the time you mention. Is it even about a post-fitting simulation? It's rare that it's worth going to that extent for such a small function.