_mm_crc32_u64 poorly defined

Question 1

Does anyone have portable code (Visual Studio and GCC) to implement the latter intrinsic? Thanks.

My friend and I wrote a c++ sse intrinsics wrapper which contains the more preferred usage of the crc32 instruction with 64bit src.

http://code.google.com/p/sse-intrinsics/

See the i_crc32() instruction. (sadly there are even more flaws with intel's sse intrinsic specifications on other instructions, see this page for more examples of flawed intrinsic design)

Question 2

The 4 intrinsic functions provided really do allow all possible uses of the Intel defined CRC32 instruction. The instruction output always 32-bits because the instruction is hard-coded to use a specific 32-bit CRC polynomial. However, the instruction allows your code to feed input data to it 8, 16, 32, or 64 bits at a time. Processing 64-bits at a time should maximize throughput. Processing 32-bits at a time is the best you can do if restricted to 32-bit build. Processing 8 or 16 bits at a time could simplify your code logic if the input byte count is odd or or not a multiple of 4/8.

#include <stdio.h>
#include <stdint.h>
#include <intrin.h>

int main (int argc, char *argv [])
    {
    int index;
    uint8_t *data8;
    uint16_t *data16;
    uint32_t *data32;
    uint64_t *data64;
    uint32_t total1, total2, total3;
    uint64_t total4;
    uint64_t input [] = {0x1122334455667788, 0x1111222233334444};

    total1 = total2 = total3 = total4 = 0;
    data8  = (void *) input;
    data16 = (void *) input;
    data32 = (void *) input;
    data64 = (void *) input;

    for (index = 0; index < sizeof input / sizeof *data8; index++)
        total1 = _mm_crc32_u8 (total1, *data8++);

    for (index = 0; index < sizeof input / sizeof *data16; index++)
        total2 = _mm_crc32_u16 (total2, *data16++);

    for (index = 0; index < sizeof input / sizeof *data32; index++)
        total3 = _mm_crc32_u32 (total3, *data32++);

    for (index = 0; index < sizeof input / sizeof *data64; index++)
        total4 = _mm_crc32_u64 (total4, *data64++);

    printf ("CRC32 result using 8-bit chunks: %08X\n", total1);
    printf ("CRC32 result using 16-bit chunks: %08X\n", total2);
    printf ("CRC32 result using 32-bit chunks: %08X\n", total3);
    printf ("CRC32 result using 64-bit chunks: %08X\n", total4);
    return 0;
    }