The MSC compiler has never supported more than 4 bytes of alignment for parameters on the x86 stack, and there is no workaround.
You can verify this yourself by compiling,
struct A { __declspec(align(4)) int x; };
void foo(A a) {}
versus,
// won't compile, alignment guarantee can't be fulfilled
struct A { __declspec(align(8)) int x; };
versus,
// __m128d is naturally aligned, again - won't compile
struct A { __m128d x; };
Generally MSC is absolved by the following,
You cannot specify alignment for function parameters.
And you cannot specify the alignment, because MSC writers wanted to reserve the freedom to decide on the alignment,
The x86 compiler uses a different method for aligning the stack. By default, the stack is 4-byte aligned. Although this is space efficient, you can see that there are some data types that need to be 8-byte aligned, and that, in order to get good performance, 16-byte alignment is sometimes needed. The compiler can determine, on some occasions, that dynamic 8-byte stack alignment would be beneficial—notably when there are double values on the stack.
The compiler does this in two ways. First, the compiler can use link-time code generation (LTCG), when specified by the user at compile and link time, to generate the call-tree for the complete program. With this, it can determine regions of the call-tree where 8-byte stack alignment would be beneficial, and it determines call-sites where the dynamic stack alignment gets the best payoff. The second way is used when the function has doubles on the stack, but, for whatever reason, has not yet been 8-byte aligned. The compiler applies a heuristic (which improves with each iteration of the compiler) to determine whether the function should be dynamically 8-byte aligned.
Thus as long as you use MSC with the 32-bit platform toolset, this issue is unavoidable.
The x64 ABI has been explicit about the alignment, defining that non-trivial structures or structures over certain sizes are passed as a pointer parameter. This is elaborated in Section 3.2.3 of the ABI, and MSC had to implement this to be compatible with the ABI.
Path 1: Use another Windows compiler toolchain: GCC or ICC.
Path 2: Move to a 64-bit platform MSC toolset
Path 3: Reduce your use cases to std::atomic<T>
with T=__m128d
, because it will be possible to skip the stack and pass the variable in an XMM register directly.