Question

I would like to use in Python something akin to -- or better than -- R arrays. R arrays are tensor-like objects with a dimnames attribute, which allows to straightforwardly allows to subset tensors based on names (strings). In numpy recarrays allow for column names, and pandas for flexible and efficient subsetting of 2-dimensional arrays. Is there something in Python that allows similar operations as slicing and subsetting of ndarrays by using names (or better, objects that are hashable and immutable in Python)?

Was it helpful?

Solution

How about this quick and dirty mapping from lists of strings to indices? You could clean up the notation with callable classes.

def make_dimnames(names):
    return [{n:i for i,n in enumerate(name)} for name in names]
def foo(d, *args):
    return [d[x] for x in args]

A = np.arange(9).reshape(3,3)
dimnames = [('x','y','z'),('a','b','c')]
Adims = make_dimnames(dimnames)
A[foo(Adims[0],'x','z'),foo(Adims[1],'b')]  # A[[0,2],[1]]
A[foo(Adims[0],'x','z'),slice(*foo(Adims[1],'b','c'))]  # A[[0,2],slice(1,2)]

Or does R do something more significant with the dimnames?

A class compresses the syntax a bit:

class bar(object):
    def __init__(self,dimnames):
        self.dd = {n:i for i,n in enumerate(dimnames)}
    def __call__(self,*args):
        return [self.dd[x] for x in args]
    def __getitem__(self,key):
        return self.dd[key]
d0, d1 = bar(['x','y','z']), bar(['a','b','c'])
A[d0('x','z'),slice(*d1('a','c'))]

http://docs.scipy.org/doc/numpy/user/basics.subclassing.html sublassing ndarray, with simple example of adding an attribute (which could be dinnames). Presumably extending the indexing to use that attribute shouldn't be hard.

Inspired by the use of __getitem__ in numpy/index_tricks, I've generalized the indexing:

class DimNames(object):
    def __init__(self, dimnames):
        self.dd = [{n:i for i,n in enumerate(names)} for names in dimnames]
    def __getitem__(self,key):
        # print key
        if isinstance(key, tuple):
            return tuple([self.parse_key(key, self.dd[i]) for i,key in enumerate(key)])
        else:
            return self.parse_key(key, self.dd[0])
    def parse_key(self,key, dd):
        if key is None:
            return key
        if isinstance(key,int):
            return key
        if isinstance(key,str):
            return dd[key]
        if isinstance(key,tuple):
            return tuple([self.parse_key(k, dd) for k in key])
        if isinstance(key,list):
            return [self.parse_key(k, dd) for k in key]
        if isinstance(key,slice):
            return slice(self.parse_key(key.start, dd),
                         self.parse_key(key.stop, dd),
                         self.parse_key(key.step, dd))
        raise KeyError

dd = DimNames([['x','y','z'], ['a','b','c']])

print A[dd['x']]              # A[0]
print A[dd['x','c']]          # A[0,2]
print A[dd['x':'z':2]]        # A[0:2:2]
print A[dd[['x','z'],:]]      # A[[0,2],:]
print A[dd[['x','y'],'b':]]   # A[[0,1], 1:]
print A[dd[:'z', :2]]         # A[:2,:2]

I suppose further steps would be to subclass A, add dd as attribute, and change its __getitem__, simplifying the notation to A[['x','z'],'b':].

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top