According to perf (see https://perf.wiki.kernel.org/index.php/Main_Page) most of the time in your code is taken by the loop instructions (comparison + jump) associated with:
for (i = 0; i < size; i += STEP_SIZE)
│ DATATYPE sum(DATATYPE **ptr, int size) { ▒
│ DATATYPE sum = 0; ▒
│ int i, j; ▒
│ for (i = 0; i < size; i += STEP_SIZE) { ▒
│ for (j = 0; j < NR_ARRAYS; j++) { ▒
│ sum += ptr[j][i]; ▒
2.83 │60: mov (%rdx),%rdi ▒
4.37 │ add $0x8,%rdx ▒
5.50 │ add (%rdi,%r8,1),%rcx ▒
│ ▒
│ DATATYPE sum(DATATYPE **ptr, int size) { ▒
│ DATATYPE sum = 0; ▒
│ int i, j; ▒
│ for (i = 0; i < size; i += STEP_SIZE) { ▒
│ for (j = 0; j < NR_ARRAYS; j++) { ▒
86.29 │ cmp %r12,%rdx ▒
│ ↑ jne 60 ▒
0.10 │ add $0x40,%r8 ▒
As a consequence you don't see the influence of bad alignment.