Pregunta

I am attempting to organize a large numpy ndarray (sets of ~1mil entries of at most 16 dimensions) into two subgroups by two of the dimensions of the array.

Currently, I'm using itertool's groupby function, but the values that it creates in my dictionary are itertools._grouper objects, where my ndarray seems to be converted to a grouper object no matter what I do.

While I could make a custom groupby function to solve this problem, it seems a fundamental problem in my coding ability in Python, a language to which I'm very new, to be unsure how to either prevent this or convert the grouper object back into an ndarray with the correct fields. I need the ndarray because its fields need to be maintained for later manipulation.

How would I fix the following code to either convert the returned groupby result fully back into an ndarray or prevent the conversion?

array = np.sort(array, order=['Front','Back','SecStruc'])
front_dict = dict((k,v) for k,v in groupby(array, lambda array : array['Front']))
for key in front_dict:
    front_dict[key] = dict((k,list(v)) for k,v in groupby(front_dict[key], 
    lambda array : front_dict[key]['Back']))

Thanks!

¿Fue útil?

Solución

I think you might be able to use numpy.split for this. You can split an array into sub-arrays by doing something like:

import numpy as np

def findsplit(a):
    diff = a[1:] != a[:-1]
    edges = np.where(diff)[0]
    return edges + 1

array = np.array([0,0,0,1,1,1,1,2,2,3,4,4,4])
s = np.split(array, findsplit(array))
for a in s:
    print a
# [0 0 0]
# [1 1 1 1]
# [2 2]
# [3]
# [4 4 4]

To get the nested dictionaries you discribe in your question you could do something like:

byFront = np.split(array, findsplit(array['Front']))
front_dict = {}
for sameFront in byFront:
    back_dict = {}
    byBack = np.split(sameFront, findsplit(sameFront['Back']))
    for sameBack in byBack:
        back_dict[sameBack['Back'][0]] = sameBack
    front_dict[sameFront['Front'][0]] = back_dict

Otros consejos

Looks like you are almost there. list(v) is a list than can easily be converted to an array.

x=np.array([0,0,0,1,1,1,1,2,2,3,4,4,4])
{k:np.array(list(v)) for k,v in groupby(x)}

{0: array([0, 0, 0]),
 1: array([1, 1, 1, 1]),
 2: array([2, 2]),
 3: array([3]),
 4: array([4, 4, 4])}

Or with a 2d array (grouping on the 1st column, and then on the last column).

x=np.array([[0,1,2],[1,2,3],[1,2,4],[1,0,4],[2,3,1]])
d={k:list(v) for k,v in groupby(x,lambda s:s[0])}
print d
# {0: [array([0, 1, 2])],
#  1: [array([1, 2, 3]), array([1, 2, 4]), array([1, 0, 4])],
#  2: [array([2, 3, 1])]}
for i in d.keys():
    d[i]={k:np.array(list(v)) for k,v in groupby(list(d[i]),lambda s:s[2])}
print d
# {0: {2: array([[0, 1, 2]])},
#  1: {3: array([[1, 2, 3]]), 4: array([[1, 2, 4], [1, 0, 4])},
#  2: {1: array([[2, 3, 1]])}}
print d[1][4]
#  [[1 2 4]
#   [1 0 4]]

It doesn't matter much whether I use list(v) or np.array(list(v)) - at either stage (provided you are interested in iterating on the 1st dimension).


using a structured array adapted from the numpy documentation

x = np.array([(1.5,2.5,(1.0,2.0)),(1.5,2.5,(2.0,4.0)),(3.,4.,(4.,5.)),(1.,3.,(2.,6.))],
        dtype=[('x','f4'),('y',np.float32),('value','f4',(2,2))])
d={k:list(v) for k,v in groupby(x,lambda s:s['x'])}
for i in d.keys():
    d[i]={k:list(v) for k,v in groupby(list(d[i]),lambda s:s['y'])}
pprint(d)
for dd in d[1.5][2.5]:
    print dd
print d[1.5][2.5][0].dtype
# [('x', '<f4'), ('y', '<f4'), ('value', '<f4', (2, 2))]
dd = np.array(d[1.5][2.5],dtype=x.dtype)
print dd
print dd.dtype
print dd[0]
# (1.5, 2.5, [[1.0, 2.0], [1.0, 2.0]])
print dd['value']
# [[[ 1.  2.] [ 1.  2.]]
#  [[ 2.  4.] [ 2.  4.]]]

The structured array character of the 'innermost' elements is preserved. I only need to use np.array(...,dtype=x.dtype) if I want to turn a list of these arrays into one array (e.g. dd).

In d[1.5][2.5][0]['value'], 1.5 and 2.5 are dictionary keys, 0 is a list index, and value is a structure array field name.


But is this use of groupby really needed? I can get that last 'value' with normal numpy indexing. And the 'rows' of x don't have to be sorted. With a very large array, speed and memory use could be important considerations.

I=(x['x']==1.5)&(x['y']==2.5)
print x[I]['value']
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top