Where is the loop-carried dependency here?

Question 1

Let's just say there's more than just a couple of things preventing this loop from vectorizing...

Consider this:

int main(){
    byte  *source = new byte[1000];
    DWORD *dest   = new DWORD[1000];

    for (int i = 0; i < 200; ++i) {
        dest[(i*2*4096)+0] = (source[(i*8)+0]);
    }
    for (int i = 0; i < 200; ++i) {
        dest[i*2*4096] = source[i*8];
    }
    for (int i = 0; i < 200; ++i) {
        dest[i*8192] = source[i*8];
    }
    for (int i = 0; i < 200; ++i) {
        dest[i] = source[i];
    }
}

Compiler Output:

main.cpp(10) : info C5002: loop not vectorized due to reason '1200'
main.cpp(13) : info C5002: loop not vectorized due to reason '1200'
main.cpp(16) : info C5002: loop not vectorized due to reason '1203'
main.cpp(19) : info C5002: loop not vectorized due to reason '1101'

Let's break this down:

The first two loops are the same. So they give the original reason 1200 which is the loop-carried dependency.
The 3rd loop is the same as the 2nd loop. Yet the compiler gives a different reason 1203:

Loop body includes non-contiguous accesses into an array

Okay... Why a different reason? I dunno. But this time, the reason is correct.
The 4th loop gives 1101:

Loop contains a non-vectorizable conversion operation (may be implicit)

So VC++ doesn't isn't smart enough to issue the SSE4.1 pmovzxbd instruction.

That's a pretty niche case, I wouldn't have expected any modern compiler to be able to do this. And if it could, you'd need to specify SSE4.1.

So the only thing that's out-of-the ordinary is why the initial loop reports a loop-carried dependency.
Well, that's a tough call... I would go so far to say that the compiler just isn't issuing the correct reason. (When it really should be non-contiguous access.)

Getting back to the point, I wouldn't expect MSVC or any compiler to be able to vectorize your original loop. Your original loop has accesses grouped in chunks of 4 - which makes it contiguous enough to vectorize. But it's a long-shot to expect the compiler to be able to recognize that.

So if it matters, I suggest manually vectorizing this loop. The intrinsic that you will need is _mm_cvtepu8_epi32().

Your original loop:

for (int i = 0; i < count; ++i) {
    dest[(i*2*pitch)+0] = (source[(i*8)+0]);
    dest[(i*2*pitch)+1] = (source[(i*8)+1]);
    dest[(i*2*pitch)+2] = (source[(i*8)+2]);
    dest[(i*2*pitch)+3] = (source[(i*8)+3]);

    dest[((i*2+1)*pitch)+0] = (source[(i*8)+4]);
    dest[((i*2+1)*pitch)+1] = (source[(i*8)+5]);
    dest[((i*2+1)*pitch)+2] = (source[(i*8)+6]);
    dest[((i*2+1)*pitch)+3] = (source[(i*8)+7]);
}

vectorizes as follows:

for (int i = 0; i < count; ++i) {
    __m128i s0 = _mm_loadl_epi64((__m128i*)(source + i*8));
    __m128i s1 = _mm_unpackhi_epi64(s0,s0);

    *(__m128i*)(dest + (i*2 + 0)*pitch) = _mm_cvtepu8_epi32(s0);
    *(__m128i*)(dest + (i*2 + 1)*pitch) = _mm_cvtepu8_epi32(s1);
}

Disclaimer: This is untested and ignores alignment.

Question 2

From the MSDN documentation, a case in which an error 1203 would be reported

void code_1203(int *A)
{
    // Code 1203 is emitted when non-vectorizable memory references
    // are present in the loop body. Vectorization of some non-contiguous 
    // memory access is supported - for example, the gather/scatter pattern.

    for (int i=0; i<1000; ++i)
    {
        A[i] += A[0] + 1;       // constant memory access not vectorized
        A[i] += A[i*2+2] + 2;  // non-contiguous memory access not vectorized
    }
}

It really could be the computations at the indexes that mess with the auto-vectorizer. Funny the error code shown is not 1203, though.

MSDN Parallelizer and Vectorizer Messages