Speed up loading 24-bit binary data into 16-bit numpy array

Question 1

It seems like you are being very roundabout here. Won't this do the same thing?

output = np.frombuffer(data,'b').reshape(-1,3)[:,1:].flatten().view('i2')

This would save some time from not zero-filling a temporary array, skipping the bitshift and avoiding some unneceessary data moves. I haven't actually benchmarked it yet, though, and I expect the savings to be modest.

Edit: I have now performed the benchmark. For len(data) of 12 million, I get 80 ms for your version and 39 ms for mine, so pretty much exactly a factor 2 speedup. Not a very big improvement, as expected, but then your starting point was already pretty fast.

Edit2: I should mention that I have assumed little endian here. However, the original question's code is also implicitly assuming little endian, so this is not a new assumption on my part.

(For big endian (data and architecture), you would replace 1: by :-1. If the data had a different endianness than the CPU, then you would also need to reverse the order of the bytes (::-1).)

Edit3: For even more speed, I think you will have to go outside python. This fortran function, which also uses openMP, gets me a factor 2+ speedup compared to my version (so 4+ times faster than yours).

subroutine f(a,b)
        implicit none
        integer*1, intent(in)  :: a(:)
        integer*1, intent(out) :: b(size(a)*2/3)
        integer :: i
        !$omp parallel do
        do i = 1, size(a)/3
                b(2*(i-1)+1) = a(3*(i-1)+2)
                b(2*(i-1)+2) = a(3*(i-1)+3)
        end do
        !$omp end parallel do
end subroutine

Compile with FOPT="-fopenmp" f2py -c -m basj{,.f90} -lgomp. You can then import and use it in python:

import basj
def convert(data): return def mine2(data): return basj.f(np.frombuffer(data,'b')).view('i2')

You can control the number of cores to use via the environment variavble OMP_NUM_THREADS, but it defaults to using all available cores.

Question 2

Inspired by @amaurea's answer, here is a cython version (I already used cython in my original code, so I'll continue with cython instead of mixing cython + fortran) :

import cython
import numpy as np
cimport numpy as np

def binary24_to_int16(char *data):
    cdef int i
    res = np.zeros(len(data)/3, np.int16)
    b = <char *>((<np.ndarray>res).data)
    for i in range(len(data)/3):
        b[2*i] = data[3*i+1]
        b[2*i+1] = data[3*i+2]
    return res

There is a factor 4 speed gain :)