GCC optimization flags for matrix/vector operations

Question

First of all, I don't recommend using -ffast-math for the following reasons:

It has been proved that the performance actually degrades when using this option in most (if not all) cases. So "fast math" is not actually that fast.
This option breaks strict IEEE compliance on floating-point operations which ultimately results in accumulation of computational errors of unpredictable nature.
You may well get different results in different environments and the difference may be substantial. The term environment (in this case) implies the combination of: hardware, OS, compiler. Which means that the diversity of situations when you can get unexpected results has exponential growth.
Another sad consequence is that programs which link against the library built with this option might expect correct (IEEE compliant) floating-point math, and this is where their expectations break, but it will be very tough to figure out why.
Finally, have a look at this article.

For the same reasons you should avoid -Ofast (as it includes the evil -ffast-math). Extract:

-Ofast

Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fno-protect-parens and -fstack-arrays.

There is no such flag as -O4. At least I'm not aware of that one, and there is no trace of it in the official GCC documentation. So the maximum in this regard is -O3 and you should be definitely using it, not only to optimize math, but in release builds in general.

-funroll-loops is a very good choice for math routines, especially involving vector/matrix operations where the size of the loop can be deduced at compile-time (and as a result unrolled by the compiler).

I can recommend 2 more flags: -march=native and -mfpmath=sse. Similarly to -O3, -march=native is good in general for release builds of any software and not only math intensive. -mfpmath=sse enables use of XMM registers in floating point instructions (instead of stack in x87 mode).

Furthermore, I'd like to say that it's a pity that you don't want to modify your code to get better performance as this is the main source of speedup for vector/matrix routines. Thanks to SIMD, SSE Intrinsics, and Vectorization, the heavy-linear-algebra code can be orders of magnitude faster than without them. However, proper application of these techniques requires in-depth knowledge of their internals and quite some time/effort to modify (actually rewrite) the code.

Nevertheless, there is one option that could be suitable in your case. GCC offers auto-vectorization which can be enabled by -ftree-vectorize, but it is unnecessary since you are using -O3 (because it includes -ftree-vectorize already). The point is that you should still help GCC a little bit to understand which code can be auto-vectorized. The modifications are usually minor (if needed at all), but you have to make yourself familiar with them. So see the Vectorizable Loops section in the link above.

Finally, I recommend you to look into Eigen, the C++ template-based library which has highly efficient implementation of most common linear algebra routines. It utilizes all the techniques mentioned here so far in a very clever way. The interface is purely object-oriented, neat, and pleasing to use. The object-oriented approach looks very relevant to linear algebra as it usually manipulates the pure objects such as matrices, vectors, quaternions, rotations, filters, and so on. As a result, when programming with Eigen, you never have to deal with such low level concepts (as SSE, Vectorization, etc.) yourself, but just enjoy solving your specific problem.