Why is my Strassen's Matrix Multiplication slow?

Question 1

One issue with Strassen's code is obvious - I don't have cutoff point, that switches to regular MM.

It's fair to say that recursing down to 1 point is the bulk of (if not the entire) problem. Trying to guess at other performance bottlenecks without addressing this is almost moot due to the massive performance hit that it brings. (In other words, you're comparing Apples to Oranges.)

As discussed in the comments, cache alignment could have an effect, but not to this scale. Furthemore, cache alignment would likely hurt the regular algorithm more than the Strassen algorithm since the latter is cache-oblivious.

void strassen(int **a, int **b, int **c, int tam) {

    // trivial case: when the matrix is 1 X 1:
    if (tam == 1) {
            c[0][0] = a[0][0] * b[0][0];
            return;
    }

That's far too small. While the Strassen algorithm has a smaller complexity, it has a much bigger Big-O constant. For one, you have function call overhead all the way down to 1 element.

This is analogous to using merge or quick sort and recursing all the way down to one element. To be efficient you need to stop the recursion when the size gets small and fall back to the classic algorithm.

In quick/merge sort, you'd fall back to a low-overhead O(n^2) insertion or selection sort. Here you would fall back to the normal O(n^3) matrix multiply.

The threshold which you fall back the classic algorithm should be a tunable threshold that will likely vary depending on the hardware and the ability of the compiler to optimize the code.

For something like Strassen multiplication where the advantage is only O(2.8074) over the classic O(n^3), don't be surprised if this threshold turns out to be very high. (thousands of elements?)

In some applications there can be many algorithms each with decreasing complexity but increasing Big-O. The result is that multiple algorithms become optimal at different sizes.

Large integer multiplication is a notorious example of this:

Grade-school Multiplication: O(N^2) optimal for < ~100 digits*
Karatsuba Multiplication: O(N^1.585) faster than above at ~100 digits*
Toom-Cook 3-way: O(N^1.465) faster than Karatsuba at ~3000 digits*
Floating-point FFT: O(> N log(N)) faster than Karatsuba/Toom-3 at ~700 digits*
Schönhage–Strassen algorithm (SSA): O(N log(n) loglog(n)) faster than FFT at ~ a billion digits*
Fixed-width Number-Theoretic Transform: O(N log(n) faster than SSA at ~ a few billion digits?*

_{*Note these example thresholds are approximate and can vary drastically - often by more than a factor of 10.}

Question 2

So, there may be more problems that this, but your first problem is that you're using arrays of pointers to arrays. And since you're using array sizes that are powers of 2, this is an especially big performance hit over allocating the elements contiguously and using integer division to fold the long array of numbers into rows.

Anyway, that's my first guess as to a problem. As I said, there may be more, and I'll add to this answer as I discover them.

Edit: This likely only contributes a small amount to the problem. The problem is likely the one Luchian Grigore refers to involving cache line contention issues with powers of two.

I verified that my concern is valid for the naive algorithm. The time for the naive algorithm goes down by almost 50% if the array is contiguous instead. Here is the code for this (using a SquareMatrix class that is C++11 dependent) on pastebin.