What is the difference between MOVDQA and MOVNTDQA, and VMOVDQA and VMOVNTDQ for WB/WC marked region?

Question 1

Note : This answer discusses primarily NT stores. Peter's answer is more comprehensive.

You would typically use the NT (non temporal) instructions when writing to memory-mapped IO (ie: GPU, etc) where the memory is strictly uncacheable and is always accessed directly.

With regular reads and writes the CPU will try to cache and write out larger blocks to main memory when it needs to. With uncacheable regions (such as MMIO) the writes have to go directly to memory and the CPU will not try to cache them. Using the NT instruction hints to the CPU that you are probably streaming a large amount of data (ie: to a frame buffer, etc) and it will try to combine those writes when it can fill an entire cache-line.

The "non-temporal" part means that you're telling the CPU that you don't intend for the write to happen immediately but that it can be delayed, within reason, until enough NT instructions have been issued to fill the cache line.

As far as I understand, you can also use the NT instructions with regular write-back memory and it will not attempt to cache those writes but will also attempt to stream when it can fill a line. In the case of writing to WB memory I'd say the application would be pretty specialized and you would need to know that you could do a better job than the CPU at managing its cache. Also the write is not going to happen immediately so anything reading back afterwards would read stale data until the combined write was executed. You need to manage this with SFENCE instructions if you need to flush any outstanding combined writes.

Question 2

Beware of processor errata when using the non-temporal instructions though, if you need them to be ordered against memory barriers (e.g. LOCK ADD, MFENCE).

Errata HSD162, BDM116 and SKL079 apply, please refer to the Haswell/Broadwell/Skylake specification updates. Basically, non-temporal MOVNTDQA from WC memory will bypass LOCK on Haswell/Broadwell and you must use MFENCE to fix. On Skylake, it is broken the other way, so non-temporal MOVNTDQA from WC memory will bypass MFENCE, and the fix is to update the Skylake microcode...

Question 3

NT stores are useful on large blocks of WB memory

NT stores movntps / movntdq / etc (and their AVX forms vmovntps etc.) work well on WB memory, treating it like WC memory, overriding the memory-ordering semantics of the region and bypassing cache, building up a full 64-byte chunk of data in an LFB to send to memory when it's fully written. (But still maintaining cache-coherency with other cores.) And yes, normal stores on WC memory work like that, too.

If evicted early, before the LFB has a full line of writes, it has to do a partial update of a DDR SDRAM block when the write request reaches a memory controller. The DRAM burst size is 64 bytes, same as the cache line size; not a coincidence.
(SSE2 maskmovdqu has an NT hint (unlike AVX vmaskmovps and so on), and causes the same problem; maybe it was efficient on early single-core CPUs and could get the memory controller to use byte-masking for writes, but it's just slow now.)

If you want NT stores ordered wrt. normal stores, use sfence (_mm_sfence) after you're done with streaming (NT) stores to a big buffer, before a normal store of a flag or pointer that other cores might read. If you don't care about the order other cores see your NT stores in (because your code is single-threaded), that's unnecessary; the current core always sees its own stores in program order, even NT stores. And they will eventually make it to a memory-mapped file or whatever.

NT loads are quite different

The SSE4.1 NT load instruction, movntdqa, is only special on WC memory. On WB memory on existing CPUs, it's the same as movdqa, just a 16-byte alignment-required load, but costing an extra uop. (Same goes for the vmovntdqa AVX form for 16 or 32-byte operations.) The NT load hint is ignored on current CPUs, and the instruction is not architecturally allowed to override the memory-ordering semantics; WB memory is strongly ordered, only WC is weakly ordered allowing load-load reordering.

Perhaps because loads without HW prefetching would normally be disastrous, but HW prefetch only knows how to do normal prefetches, not NT prefetches like prefetchnta that minimize cache pollution by bypassing L3 if possible, or on CPUs with inclusive L3 cache (client CPUs, and Xeon before SKX), using only a single "way" in each set. And bypassing L2 while prefetching into L1d, unless you're actually prefetching from WC memory. From WC memory, NT prefetch can actually prefetch into an LFB, IIRC. (NT loads from WC memory load into an LFB not cache, where later loads from the same line can pull data from, if I'm remembering correctly.) See Difference between PREFETCH and PREFETCHNTA instructions for more details about SW prefetches.

Intel's whitepaper about copying from video RAM to main memory has some examples and details: https://web.archive.org/web/20120918010837/http://software.intel.com/en-us/articles/increasing-memory-throughput-with-intel-streaming-simd-extensions-4-intel-sse4-streaming-load/

Regular loads from WC memory (like movdqu / movdqa or plain integer mov) do in theory allow load speculation, but Dr. McCalpin reports that on Sandybridge at least, you don't actually get much if any memory-level parallelism.