I already found a solution. Thus for the record:
void f2m_intel_mult(
uint32_t t, // length of arrays A and B
uint32_t *A,
uint32_t *B,
uint32_t *C
)
{
memset(C, 0, 2*t*sizeof(uint32_t));
uint32_t offset = 0;
union{ uint64_t val; struct{uint32_t low; uint32_t high;} halfs;} prod;
uint32_t i;
uint32_t j;
for(i=0; i<t; i++){
for(j=0; j<t; j++){
prod.halfs.low = A[i];
prod.halfs.high = 0;
asm ("pclmulqdq %2, %1, %0;"
: "+x"(prod.val)
: "x"(B[j]), "i"(offset)
);
C[i+j] = C[i+j] ^ prod.halfs.low;
C[i+j+1] = C[i+j+1] ^ prod.halfs.high;
}
}
}
I think it is possible to use 64bit registers for pclmulqdq, but I couldn't find out how to get this working with inline assembler. Does anybody know this?
Nevertheless it is also possible to do the same with intrinsics. (If you want the code just ask.)
Besides it is possible to optimize the calculation further with Karatsuba, if you know the size t of the arrays.