On Sandybridge/Ivybridge, 256b AVX loads and stores are cracked into two 128b ops [as Peter Cordes notes, these aren't quite µops, but it requires two cycles for the operation to clear the port] in the load/store execution units, so there's no reason to expect the version using those instructions to be much faster.
Why is it slower? Two possibilities come to mind:
for base + index + offset addressing, the latency of a 128b load is 6 cycles, whereas the latency of a 256b load is 7 cycles (Table 2-8 in the Intel Optimization Manual). Although your benchmark should be bound by thoughput and not latency, the longer latency means that the processor takes longer to recover from any hiccups (pipeline bubbles or prediction misses or interrupt servicing or ...), which does have some impact.
in 11.6.2 of the same document, Intel suggests that the penalty for cache line and page crossing may be larger for 256b loads than it is for 128b loads. If your loads are not all 32-byte aligned, this may also explain the slowdown you are seeing when using the 256b load/store operations:
Example 11-12 shows two implementations for SAXPY with unaligned addresses. Alternative 1 uses 32 byte loads and alternative 2 uses 16 byte loads. These code samples are executed with two source buffers, src1, src2, at 4 byte offset from 32- Byte alignment, and a destination buffer, DST, that is 32-Byte aligned. Using two 16- byte memory operations in lieu of 32-byte memory access performs faster.