Using MATLAB R2013a on a Tesla C2070, I see this:
A = gpuArray.randn(1000);
tic; [l,u,p]=lu(A); toc
Elapsed time is 0.016663 seconds.
which is about 2x faster than my CPU. As the matrix size increases further, the speedup increases, on my machine peaking at about 5x faster on the GPU - this is typical for a high-end (albeit slightly old) GPU compared to a decent 6-core CPU.