Domanda

I don't know the exact technical terms for what I wish to do, so I'll try and demonstrate with an example:

I have two vectors the same length, a and b, as below:

In [41]:a
Out[41]:
array([ 0.61689215,  0.31368813,  0.47680184, ...,  0.84857976,
    0.97026244,  0.89725481])

In [42]:b
Out[42]:
array([35, 36, 37, ..., 36, 37, 38])

a contains N floating point numbers and b contains N elements: keys with 10 distinct values: 35, 36, 37, ..., 43, 44

I wish to get a new matrix M that has 10 columns where the first column contains all the rows in a whose corresponding key in b is 35. The second column in M contains all the rows in a whose corresponding key in b is 36. Etc. all up to column 10 in M.

I hope this was clear. Thank you

È stato utile?

Soluzione 2

you can use pandas:

import numpy as np
import pandas as pd

a = np.random.rand(50)
b = np.random.randint(10, 15, 50)

s = pd.Series(a)
s.groupby(b).apply(pd.Series.reset_index, drop=True).unstack(level=0)

the output is:

          10        11        12        13        14
0   0.465079  0.041393  0.692856  0.634328  0.179690
1   0.934678  0.746048  0.060014  0.072626  0.824729
2   0.388190  0.510527  0.078662  0.077157  0.291183
3   0.972033  0.761159  0.017317  0.104768  0.278871
4   0.750713  0.430246  0.083407  0.262037  0.487742
5   0.216965  0.482364  0.820535  0.207008  0.276452
6   0.282038  0.607303  0.675856  0.994369  0.602059
7   0.897106  0.398808  0.312332  0.751388  0.878177
8   0.229121       NaN       NaN  0.061288  0.032066
9   0.810678       NaN       NaN       NaN  0.718237
10  0.571125       NaN       NaN       NaN  0.668292
11  0.410750       NaN       NaN       NaN  0.288145
12  0.984507       NaN       NaN       NaN       NaN

Altri suggerimenti

itertools.groupby can be used to group values (after sorting). Use of numpy arrays is optional.

import numpy as np
import itertools
N=50
# a = np.random.rand(50)*100
a = np.random.randint(0,100,N) # int to make printing more compact
b = np.random.randint(35,45, N)

# make structured array to easily sort both arrays together
dtype = np.dtype([('a',float),('b',int)])
ab = np.ndarray(a.shape,dtype=dtype)
ab['a'] = a
ab['b'] = b
# ab = np.sort(ab,order=['b']) # sorts both 'b' and 'a'
I = np.argsort(b,kind='mergesort') # preserves order
ab = ab[I]

# now group, and extract lists of lists
gp = itertools.groupby(ab, lambda x: x['b'])
xx = [list(x[1]) for x in gp]
#print np.array([[y[0] for y in x] for x in xx]) # list of lists

def filled(x):
    M = max(len(z) for z in x)
    return np.array([z+[np.NaN]*(M-len(z)) for z in x])
print filled([[y[1] for y in x] for x in xx]).T
print filled([[y[0] for y in x] for x in xx]).T

producing:

[[ 35.  36.  37.  38.  39.  40.  41.  42.  43.  44.]
 [ 35.  36.  37.  38.  39.  40.  41.  42.  43.  44.]
 [ nan  36.  37.  nan  39.  40.  41.  42.  43.  44.]
 [ nan  36.  37.  nan  39.  40.  41.  42.  43.  44.]
 ...]

[[ 54.  69.  34.  28.  71.  53.  33.  19.  64.  56.]
 [ 90.  52.  11.   9.  50.  53.  25.  37.  69.  56.]
 [ nan  97.  31.  nan  69.  35.   2.  80.  91.  54.]
 [ nan  33.  87.  nan  47.  90.  81.  45.  86.  57.]
 ...]

I am using argsort with mergesort to preserve the order of a within the sublists. np.sort lexically sorts on both b and a (contrary to my expectations with the order parameter).

An alternative, using a Python dictionary, also preserves the order of a. It probably is slower on large arrays, but it hides fewer details:

import collections
d = collections.defaultdict(list)
for k,v in zip(b,a):
    d[k].append(v)
values = [d[k] for k in sorted(d.keys())]
print filled(values).T

Here's a way to do it without Pandas (thus you need to track the column labels manually):

import numpy as np
from itertools import izip_longest
from collections import defaultdict

a = np.random.rand(50)
b = np.random.randint(10, 15, 50)
d = defaultdict(lambda:[])

for i, key_val in enumerate(b):
    d[key_val].append(a[i])

output = np.asarray(list(izip_longest(*(d.values()), 
                                      fillvalue=np.NaN)))

print (a)
print (b)
print (output)

This gives:

a:

array([ 0.98688273,  0.95584584,  0.91011945,  0.56402919,  0.86185936,
        0.09380343,  0.69290659,  0.97238284,  0.81297425,  0.73446398,
        0.25927151,  0.44622982,  0.20537961,  0.61665218,  0.90168399,
        0.58556404,  0.47017152,  0.32278718,  0.15044929,  0.07859976,
        0.26715756,  0.38281878,  0.30169241,  0.47785937,  0.15377038,
        0.93395325,  0.79099068,  0.92471442,  0.03154578,  0.0437627 ,
        0.31711433,  0.78550517,  0.77062104,  0.76002167,  0.1842867 ,
        0.52935392,  0.16038216,  0.46510856,  0.4311615 ,  0.73923847,
        0.45499238,  0.2630405 ,  0.67722848,  0.1391463 ,  0.50800704,
        0.50618842,  0.19540159,  0.38150066,  0.82831838,  0.3383787 ])

b:

array([14, 10, 13, 12, 12, 13, 13, 12, 11, 10, 10, 13, 14, 12, 11, 12, 14,
       12, 12, 14, 11, 10, 13, 13, 13, 10, 14, 11, 13, 11, 11, 11, 12, 10,
       11, 11, 14, 12, 12, 14, 13, 10, 11, 14, 13, 11, 10, 11, 12, 12])

output:

array([[ 0.95584584,  0.81297425,  0.56402919,  0.91011945,  0.98688273],
       [ 0.73446398,  0.90168399,  0.86185936,  0.09380343,  0.20537961],
       [ 0.25927151,  0.26715756,  0.97238284,  0.69290659,  0.47017152],
       [ 0.38281878,  0.92471442,  0.61665218,  0.44622982,  0.07859976],
       [ 0.93395325,  0.0437627 ,  0.58556404,  0.30169241,  0.79099068],
       [ 0.76002167,  0.31711433,  0.32278718,  0.47785937,  0.16038216],
       [ 0.2630405 ,  0.78550517,  0.15044929,  0.15377038,  0.73923847],
       [ 0.19540159,  0.1842867 ,  0.77062104,  0.03154578,  0.1391463 ],
       [        nan,  0.52935392,  0.46510856,  0.45499238,         nan],
       [        nan,  0.67722848,  0.4311615 ,  0.50800704,         nan],
       [        nan,  0.50618842,  0.82831838,         nan,         nan],
       [        nan,  0.38150066,  0.3383787 ,         nan,         nan]])
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top