Instead of using implicit aligned loads/stores like this:
__m128* a = (__m128*) (data1+1); // <-- +1
__m128* b = (__m128*) (data2);
__m128* c = (__m128*) (data3);
*c = _mm_add_ps(*a, *b);
use explicit aligned/unaligned loads/stores as appropriate, e.g.:
__m128 va = _mm_loadu_ps(data1+1); // <-- +1 (NB: use unaligned load)
__m128 vb = _mm_load_ps(data2);
__m128 vc = _mm_add_ps(va, vb);
_mm_store_ps(data3, vc);
Same amount of code (i.e. same number of instructions), but it won't crash, and you have explicit control over which loads/stores are aligned and which are unaligned.
Note that recent CPUs have relatively small penalties for unaligned loads, but on older CPUs there can be a 2x or greater hit.