It seems like you are being very roundabout here. Won't this do the same thing?
output = np.frombuffer(data,'b').reshape(-1,3)[:,1:].flatten().view('i2')
This would save some time from not zero-filling a temporary array, skipping the bitshift and avoiding some unneceessary data moves. I haven't actually benchmarked it yet, though, and I expect the savings to be modest.
Edit: I have now performed the benchmark. For len(data)
of 12 million, I get 80 ms for your version and 39 ms for mine, so pretty much exactly a factor 2 speedup. Not a very big improvement, as expected, but then your starting point was already pretty fast.
Edit2: I should mention that I have assumed little endian here. However, the original question's code is also implicitly assuming little endian, so this is not a new assumption on my part.
(For big endian (data and architecture), you would replace 1:
by :-1
. If the data had a different endianness than the CPU, then you would also need to reverse the order of the bytes (::-1
).)
Edit3: For even more speed, I think you will have to go outside python. This fortran function, which also uses openMP, gets me a factor 2+ speedup compared to my version (so 4+ times faster than yours).
subroutine f(a,b)
implicit none
integer*1, intent(in) :: a(:)
integer*1, intent(out) :: b(size(a)*2/3)
integer :: i
!$omp parallel do
do i = 1, size(a)/3
b(2*(i-1)+1) = a(3*(i-1)+2)
b(2*(i-1)+2) = a(3*(i-1)+3)
end do
!$omp end parallel do
end subroutine
Compile with FOPT="-fopenmp" f2py -c -m basj{,.f90} -lgomp
. You can then import and use it in python:
import basj
def convert(data): return def mine2(data): return basj.f(np.frombuffer(data,'b')).view('i2')
You can control the number of cores to use via the environment variavble OMP_NUM_THREADS
, but it defaults to using all available cores.