Is there a more efficient way to broadcast 4 contiguous doubles into 4 YMM registers?

Question 1

In my matrix multiplication code I only have to use the broadcast once per kernel code but if you really want to load four doubles in one instruction and then broadcast them to four registers you can do it like this

#include <stdio.h>
#include <immintrin.h>

int main() {
    double in[] = {1,2,3,4};
    double out[4];
    __m256d x4 = _mm256_loadu_pd(in);
    __m256d t1 = _mm256_permute2f128_pd(x4, x4, 0x0);
    __m256d t2 = _mm256_permute2f128_pd(x4, x4, 0x11);
    __m256d broad1 = _mm256_permute_pd(t1,0);
    __m256d broad2 = _mm256_permute_pd(t1,0xf);
    __m256d broad3 = _mm256_permute_pd(t2,0);
    __m256d broad4 = _mm256_permute_pd(t2,0xf);

    _mm256_storeu_pd(out,broad1);   
    printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);
    _mm256_storeu_pd(out,broad2);   
    printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);
    _mm256_storeu_pd(out,broad3);   
    printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);
    _mm256_storeu_pd(out,broad4);   
    printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);
}

Edit: Here is another solution based on Paul R's suggestion.

__m256 t1 = _mm256_broadcast_pd((__m128d*)&b[4*k+0]);
__m256 t2 = _mm256_broadcast_pd((__m128d*)&b[4*k+2]);
__m256d broad1 = _mm256_permute_pd(t1,0);
__m256d broad2 = _mm256_permute_pd(t1,0xf);
__m256d broad3 = _mm256_permute_pd(t2,0);
__m256d broad4 = _mm256_permute_pd(t2,0xf);

Question 2

TL;DR: It's almost always best to just do four broadcast-loads using _mm256_set1_pd(). This very good on Haswell and later, where vbroadcastsd ymm,[mem] doesn't require an ALU shuffle operation, and usually also the best option for Sandybridge/Ivybridge (where it's a 2-uop load + shuffle instruction).

It also means you don't need to care about alignment at all, beyond natural alignment for a double.

The first vector is ready sooner than if you did a two-step load + shuffle, so out-of-order execution can potentially get started on the code using these vectors while the first one is still loading. AVX512 can even fold broadcast-loads into memory operands for ALU instructions, so doing it this way will allow a recompile to take slight advantage of AVX512 with 256b vectors.

(It's usually best to use set1(x), not _mm256_broadcast_sd(&x); If the AVX2-only register-source form of vbroadcastsd isn't available, the compiler can choose to store -> broadcast-load or to do two shuffles. You never know when inlining will mean your code will run on inputs that are already in registers.)

If you're really bottlenecked on load-port resource-conflicts or throughput, not total uops or ALU / shuffle resources, it might help to replace a pair of 64->256b broadcasts with a 16B->32B broadcast-load (vbroadcastf128/_mm256_broadcast_pd) and two in-lane shuffles (vpermilpd or vunpckl/hpd (_mm256_shuffle_pd)).

Or with AVX2: load 32B and use 4 _mm256_permute4x64_pd shuffles to broadcast each element into a separate vector.

Source Agner Fog's insn tables (and microarch pdf):

Intel Haswell and later:

vbroadcastsd ymm,[mem] and other broadcast-load insns are 1uop instructions that are handled entirely by a load port (the broadcast happens "for free").

The total cost of doing four broadcast-loads this way is 4 instructions. fused-domain: 4uops. unfused-domain: 4 uops for p2/p3. Throughput: two vectors per cycle.

Haswell only has one shuffle unit, on port5. Doing all your broadcast-loads with load+shuffle will bottleneck on p5.

Maximum broadcast throughput is probably with a mix of vbroadcastsd ymm,m64 and shuffles:

## Haswell maximum broadcast throughput with AVX1
vbroadcastsd    ymm0, [rsi]
vbroadcastsd    ymm1, [rsi+8]
vbroadcastf128  ymm2, [rsi+16]     # p23 only on Haswell, also p5 on SnB/IvB
vunpckhpd       ymm3, ymm2,ymm2
vunpcklpd       ymm2, ymm2,ymm2
vbroadcastsd    ymm4, [rsi+32]     # or vaddpd ymm0, [rdx+something]
#add             rsi, 40

Any of these addressing modes can be two-register indexed addressing modes, because they don't need to micro-fuse to be a single uop.

AVX1: 5 vectors per 2 cycles, saturating p2/p3 and p5. (Ignoring cache-line splits on the 16B load). 6 fused-domain uops, leaving only 2 uops per 2 cycles to use the 5 vectors... Real code would probably use some of the load throughput to load something else (e.g. a non-broadcast 32B load from another array, maybe as a memory operand to an ALU instruction), or to leave room for stores to steal p23 instead of using p7.

## Haswell maximum broadcast throughput with AVX2
vmovups         ymm3, [rsi]
vbroadcastsd    ymm0, xmm3         # special-case for the low element; compilers should generate this from _mm256_permute4x64_pd(v, 0)
vpermpd         ymm1, ymm3, 0b01_01_01_01   # NASM syntax for 0x99
vpermpd         ymm2, ymm3, 0b10_10_10_10
vpermpd         ymm3, ymm3, 0b11_11_11_11
vbroadcastsd    ymm4, [rsi+32]
vbroadcastsd    ymm5, [rsi+40]
vbroadcastsd    ymm6, [rsi+48]
vbroadcastsd    ymm7, [rsi+56]
vbroadcastsd    ymm8, [rsi+64]
vbroadcastsd    ymm9, [rsi+72]
vbroadcastsd    ymm10,[rsi+80]    # or vaddpd ymm0, [rdx + whatever]
#add             rsi, 88

AVX2: 11 vectors per 4 cycles, saturating p23 and p5. (Ignoring cache-line splits for the 32B load...). Fused-domain: 12 uops, leaving 2 uops per 4 cycles beyond this.

I think 32B unaligned loads are a bit more fragile in terms of performance than unaligned 16B loads like vbroadcastf128.

Intel SnB/IvB:

vbroadcastsd ymm, m64 is 2 fused-domain uops: p5 (shuffle) and p23 (load).

vbroadcastss xmm, m32 and movddup xmm, m64 are single-uop load-port-only. Interestingly, vmovddup ymm, m256 is also a single-uop load-port-only instruction, but like all 256b loads, it occupies a load port for 2 cycles. It can still generate a store-address in the 2nd cycle. This uarch doesn't deal well with cache-line splits for unaligned 32B-loads, though. gcc defaults to using movups / vinsertf128 for unaligned 32B loads with -mtune=sandybridge / -mtune=ivybridge.

4x broadcast-load: 8 fused-domain uops: 4 p5 and 4 p23. Throughput: 4 vectors per 4 cycles, bottlenecking on port 5. Multiple loads from the same cache line in the same cycle don't cause a cache-bank conflict, so this is nowhere near saturating the load ports (also needed for store-address generation). That only happens on the same bank of two different cache lines in the same cycle.

Multiple 2-uop instructions with no other instructions between is the worst case for the decoders if the uop-cache is cold, but a good compiler would mix in single-uop instructions between them.

SnB has 2 shuffle units, but only the one on p5 can handle shuffles that have a 256b version in AVX. Using a p1 integer-shuffle uop to broadcast a double to both elements of an xmm register doesn't get us anywhere, since vinsertf128 ymm,ymm,xmm,i takes a p5 shuffle uop.

## Sandybridge maximum broadcast throughput: AVX1
vbroadcastsd    ymm0, [rsi]
add             rsi, 8

one per clock, saturating p5 but only using half the capacity of p23.

We can save one load uop at the cost of 2 more shuffle uops, throughput = two results per 3 clocks:

vbroadcastf128  ymm2, [rsi+16]     # 2 uops: p23 + p5 on SnB/IvB
vunpckhpd       ymm3, ymm2,ymm2    # 1 uop: p5
vunpcklpd       ymm2, ymm2,ymm2    # 1 uop: p5

Doing a 32B load and unpacking it with 2x vperm2f128 -> 4x vunpckh/lpd might help if stores are part of what's competing for p23.

Question 3

Here is a variant built upon Z Boson's original answer (before edit), using two 128-bit loads instead of one 256-bit load.

__m256d b01 = _mm256_castpd128_pd256(_mm_load_pd(&b[4*k+0]));
__m256d b23 = _mm256_castpd128_pd256(_mm_load_pd(&b[4*k+2]));
__m256d b0101 = _mm256_permute2f128_pd(b01, b01, 0);
__m256d b2323 = _mm256_permute2f128_pd(b23, b23, 0);
__m256d b0000 = _mm256_permute_pd(b0101, 0);
__m256d b1111 = _mm256_permute_pd(b0101, 0xf);
__m256d b2222 = _mm256_permute_pd(b2323, 0);
__m256d b3333 = _mm256_permute_pd(b2323, 0xf);

In my case this is slightly faster than using one 256-bit load, possibly because the first permute can start before the second 128-bit load completes.

Edit: gcc compiles the two loads and the first 2 permutes into

vmovapd (%rdi),%xmm8
vmovapd 0x10(%rdi),%xmm4
vperm2f128 $0x0,%ymm8,%ymm8,%ymm1
vperm2f128 $0x0,%ymm4,%ymm4,%ymm2

Paul R's suggestion of using _mm256_broadcast_pd() can be written as:

__m256d b0101 = _mm256_broadcast_pd((__m128d*)&b[4*k+0]);
__m256d b2323 = _mm256_broadcast_pd((__m128d*)&b[4*k+2]);

which compiles into

vbroadcastf128 (%rdi),%ymm6
vbroadcastf128 0x10(%rdi),%ymm11

and is faster than doing two vmovapd+vperm2f128 (tested).

In my code, which is bound by vector execution ports instead of L1 cache accesses, this is still slightly slower than 4 _mm256_broadcast_sd(), but I imagine that L1 bandwidth-constrained code can benefit greatly from this.