I have a function that calculates pairwise correlation between all pairs of rows in a numpy array. It was working fine, but then I remembered that every so often I am going to have to deal with missing data. I used the masked array to try and solve this but it slows down the computation by quite a bit. Any ideas around using the masked functions. The real bottleneck I think is going to be in the np.ma.dot
function. but I have added some profiling I quickly mocked up with iPython. I should say the 5000 is on the lower end of the spectrum in terms of number of rows these arrays are going to have. Some might have upwards of 300,000. The code with the masked arrays looks about 20 times slower than the code without, but obviously with missing data, the code without the masked arrays just produces an NaN every so often.
First a quick and dirty way to generating some sample data
genotypes = np.empty((5000,200))
for i in xrange(5000):
genotypes[i] = np.random.binomial(3,.333, size=200)
pValues = np.random.uniform(0,1,5000)
Then the functions to test
def testMask(pValsArray, genotypeArray):
nSNPs = len(pValsArray)-1
genotypeArray = np.ma.masked_array(genotypeArray, np.isnan(genotypeArray))
chisq = np.sum(-2 * np.log(pValsArray))
ms = genotypeArray.mean(axis=1)[(slice(None,None,None),None)]
datam = genotypeArray - ms
datass = np.ma.sqrt(np.ma.sum(datam**2, axis=1))
runningSum = 0
for i in xrange(nSNPs):
temp = np.ma.dot(datam[i:],datam[i].T)
d = (datass[i:]*datass[i])
rs = temp / d
rs = np.absolute(rs)[1:]
runningSum += 3.25 * np.sum(rs) + .75 * np.dot(rs, rs)
sigmaSq = 4*nSNPs+2*runningSum
E = 2*nSNPs
df = (2*(E*E))/sigmaSq
runningSum = sigmaSq/(2*E)
d = chisq/runningSum
brownsP = stats.chi2.sf(d, df)
return brownsP
def testNotMask(pValsArray, genotypeArray):
nSNPs = len(pValsArray)-1
chisq = np.sum(-2 * np.log(pValsArray))
ms = genotypeArray.mean(axis=1)[(slice(None,None,None),None)]
datam = genotypeArray - ms
datass = np.sqrt(stats.ss(datam, axis=1))
runningSum = 0
for i in xrange(nSNPs):
temp = np.dot(datam[i:],datam[i].T)
d = (datass[i:]*datass[i])
rs = temp / d
rs = np.absolute(rs)[1:]
runningSum += 3.25 * np.sum(rs) + .75 * np.dot(rs, rs)
sigmaSq = 4*nSNPs+2*runningSum
E = 2*nSNPs
df = (2*(E*E))/sigmaSq
runningSum = sigmaSq/(2*E)
d = chisq/runningSum
brownsP = stats.chi2.sf(d, df)
return brownsP
And some timing
%timeit testMask(pValues, genotypes)
1 loops, best of 3: 14.3 s per loop
%timeit testNotMask(pValues, genotypes)
1 loops, best of 3: 678 ms per loop
add some missing data and go again:
randis = np.random.randint(0,5000, 10)
randjs = np.random.randint(0,200, 10)
for i,j in zip(randis, randjs):
genotypes[i,j] = None
%timeit testMask(pValues, genotypes)
1 loops, best of 3: 14.2 s per loop
%timeit testNotMask(pValues, genotypes)
1 loops, best of 3: 654 ms per loop
and some profiling:
%prun
2559677 function calls in 15.045 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
9791 5.259 0.001 5.259 0.001 {method 'copy' of 'numpy.ndarray' objects}
4999 2.877 0.001 11.888 0.002 extras.py:949(dot)
9794 1.566 0.000 1.566 0.000 {numpy.core.multiarray.copyto}
14997 1.497 0.000 1.564 0.000 {numpy.core._dotblas.dot}
30007 0.559 0.000 0.559 0.000 {method 'reduce' of 'numpy.ufunc' objects}
94996 0.450 0.000 0.875 0.000 core.py:2751(_update_from)
864970 0.347 0.000 0.347 0.000 {getattr}
5000 0.346 0.000 0.802 0.000 core.py:1065(__call__)
1 0.240 0.240 15.045 15.045 <ipython-input-115-2aab1c8ea4c5>:1(testMask)
5000 0.196 0.000 0.196 0.000 core.py:771(__call__)
5001 0.147 0.000 0.551 0.000 core.py:917(__call__)
24996 0.143 0.000 0.609 0.000 core.py:2930(__getitem__)
54998 0.140 0.000 0.775 0.000 core.py:2776(__array_finalize__)
419985 0.126 0.000 0.126 0.000 {method 'update' of 'dict' objects}
1 0.093 0.093 0.111 0.111 core.py:5990(power)
339994 0.077 0.000 0.077 0.000 {isinstance}
50015 0.072 0.000 0.072 0.000 {numpy.core.multiarray.array}
60002 0.060 0.000 0.568 0.000 {method 'view' of 'numpy.ndarray' objects}
5000 0.060 0.000 0.199 0.000 core.py:2626(__new__)
14999 0.058 0.000 7.412 0.000 core.py:3341(filled)
25005 0.055 0.000 0.092 0.000 core.py:609(getdata)
EDIT:
I tried perimosocordiae's answers but I am still getting nan
s. It looks like the mean
, stats.ss
and np.sqrt
function all care about nan
values.
def fastNotMask(pValsArray, genotypeArray):
nSNPs = len(pValsArray)-1
chisq = np.sum(-2 * np.log(pValsArray))
ms = genotypeArray.mean(axis=1)[(slice(None,None,None),None)]
print ms
datam = genotypeArray - ms
print datam
datass = np.sqrt(stats.ss(datam, axis=1))
print datass
runningSum = 0
for i in xrange(nSNPs):
temp = np.dot(datam[i:],datam[i].T)
d = (datass[i:]*datass[i])
rs = temp / d
rs = np.absolute(rs)[1:]
runningSum += 3.25 * np.nansum(rs) + .75 * np.nansum(rs * rs)
print runningSum
sigmaSq = 4*nSNPs+2*runningSum
E = 2*nSNPs
df = (2*(E*E))/sigmaSq
runningSum = sigmaSq/(2*E)
d = chisq/runningSum
brownsP = stats.chi2.sf(d, df)
return brownsP
Testing this with a small amount of output shows that the nan
s aren't being properly handled.
pValues = np.random.uniform(0,1,10)
genotypes = np.empty((10,10))
for i in xrange(10):
genotypes[i] = np.random.binomial(2,.5, size=10)
randis = np.random.randint(0,10, 2)
randjs = np.random.randint(0,10, 2)
for i,j in zip(randis, randjs):
genotypes[i,j] = None
print testfastMask(pValues, genotypes)
[[ 1.5]
[ 1.2]
[ 0.9]
[ 1.2]
[ 1.1]
[ 0.6]
[ nan]
[ 1.1]
[ nan]
[ 0.8]]
[[-0.5 -0.5 0.5 0.5 -0.5 -0.5 0.5 0.5 -0.5 0.5]
[-0.2 0.8 -0.2 -0.2 -0.2 -0.2 -0.2 0.8 -0.2 -0.2]
[-0.9 0.1 -0.9 1.1 0.1 -0.9 1.1 0.1 0.1 0.1]
[-0.2 -0.2 -0.2 -0.2 0.8 -0.2 -0.2 -1.2 0.8 0.8]
[-0.1 -0.1 -0.1 -0.1 -0.1 -0.1 0.9 -0.1 0.9 -1.1]
[-0.6 1.4 0.4 -0.6 -0.6 0.4 -0.6 -0.6 0.4 0.4]
[ nan nan nan nan nan nan nan nan nan nan]
[-0.1 0.9 -1.1 -1.1 0.9 -0.1 0.9 -1.1 0.9 -0.1]
[ nan nan nan nan nan nan nan nan nan nan]
[ 1.2 -0.8 -0.8 0.2 -0.8 0.2 1.2 0.2 0.2 -0.8]]
[ 1.58113883 1.26491106 2.21359436 1.8973666 1.70293864 2.0976177
nan 2.62678511 nan 2.36643191]
nan
nan
Did I miss something here? Could this be a version issue. I am using python 2.7 and numpy 1.7.1?
Cheers for the help.