The main challenge in implementation of arbitrary-size matrix-matrix multiplication is not the use of SIMD, but reuse of cached data. The paper Anatomy of High-Performance Matrix Multiplication by Goto and Van de Geijn is a must-read if you want to implement cache-friendly matrix-matrix multiplication, and it also discusses the choice of kernels to be SIMD-friendly. After reading this paper expect to achieve 50% of machine peak on matrix-matrix multiplication after two weeks of efforts.
However, if the purpose of this work is not pure learning, I strongly recommend to use a highly optimized library. On x86 your best options are OpenBLAS (BSD-licensed, supports dynamic CPU dispatching), BLIS (BSD-licensed, easily portable to new processors), and Intel MKL (commercial, supports dynamic CPU dispatching on Intel processors). For performance reasons it is better to avoid ATLAS unless you target a very exotic architecture which is not supported by other libraries.