Memoize a costly computation of a data frame

https://stackoverflow.com/questions/19786040

04-07-2022
|

Question

I have a costly computation, running on pandas DataFrames. I'd like to memoize it. I'm trying to figure out, what I can use for this.

In [16]: id(pd.DataFrame({1: [1,2,3]}))
Out[16]: 52015696

In [17]: id(pd.DataFrame({1: [1,2,3]}))
Out[17]: 52015504

In [18]: id(pd.DataFrame({1: [1,2,3]}))
Out[18]: 52015504

In [19]: id(pd.DataFrame({1: [1,2,3]})) # different results, won't work for my case
Out[19]: 52015440

In [20]: hash(pd.DataFrame({1: [1,2,3]})) # throws
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-20-3bddc0b20163> in <module>()
----> 1 hash(pd.DataFrame({1: [1,2,3]}))

/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in __hash__(self)
     52     def __hash__(self):
     53         raise TypeError('{0!r} objects are mutable, thus they cannot be'
---> 54                               ' hashed'.format(self.__class__.__name__))
     55 
     56     def __unicode__(self):

TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed

Is it possible to do what I want, given that I'm sure that I'm not mutating the DataFrame that gets memoized?

Solution

If you don't mind comparing indexes or column names, you can convert your DataFrame to tuple:

>>> df1 = pd.DataFrame({1: [1,2,3]})
>>> df2 = pd.DataFrame({1: [1,2,3]})
>>> hash(tuple(tuple(x) for x in df1.values)) == hash(tuple(tuple(x) for x in df2.values))
True
>>> id(df1) == id(df2)
False

You can also use map function instead of generator:

tuple(map(tuple, df1.values))

If you need to compare indexes too, you can add it as a column. You can also keep column names by creating namedtuple:

>>> from collections import namedtuple
>>> from pprint import pprint
>>> df = pd.DataFrame({1: [1,2,3], 2:[3,4,5]})
>>> df['index'] = df.index
>>> df
   1  2  index
0  1  3      0
1  2  4      1
2  3  5      2
>>>
>>> dfr = namedtuple('row', map(lambda x: 'col_' + str(x), df.columns))
>>> res = tuple(map(lambda x: dfr(*x), df.values))
>>> pprint(res)
(row(col_1=1, col_2=3, col_index=0),
 row(col_1=2, col_2=4, col_index=1),
 row(col_1=3, col_2=5, col_index=2))

Hope it helps.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow