Optimising Pandas multi-index lookup

https://stackoverflow.com/questions/22451240

15-06-2023
|

Question

I Use Pandas 0.12.0. Say multi_df is a Pandas dataframe with multiple index. And I have a (long) list of tuples (multiple indexes), named look_up_list. I want to perform an operation if a tuple in look_up_list is in multi_df.

Below is my code. Is there a faster way to achieve this? In reality len(multi_df)and len(look_up_list) are quite large so I need to optimise this line: [multi_df.ix[idx]**2 for idx in look_up_list if idx in multi_df.index].

In particular, line_profiler tells me that the contidional check: if idx in multi_df.index takes a long time.

import pandas as pd
df = pd.DataFrame({'id' : range(1,9),
                    'code' : ['one', 'one', 'two', 'three',
                                'two', 'three', 'one', 'two'],
                    'colour': ['black', 'white','white','white',
                            'black', 'black', 'white', 'white'],
                    'texture': ['soft', 'soft', 'hard','soft','hard',
                                        'hard','hard','hard'],
                    'shape': ['round', 'triangular', 'triangular','triangular','square',
                                        'triangular','round','triangular']
                    },  columns= ['id','code','colour', 'texture', 'shape'])
multi_df = df.set_index(['code','colour','texture','shape']).sort_index()['id']

# define the list of indices that I want to look up for in multi_df
look_up_list = [('two', 'white', 'hard', 'triangular'),('five', 'black', 'hard', 'square'),('four', 'black', 'hard', 'round') ] 
# run a list comprehension
[multi_df.ix[idx]**2 for idx in look_up_list if idx in multi_df.index]

P.S: The actual operation in the list comprehension is not multi_df.ix[idx]**2, but something like: a_slow_function(multi_df.ix[idx]).

Solution

Perhaps use multi_df.loc[look_up_list].dropna().

import pandas as pd
df = pd.DataFrame(
    {'id': range(1, 9),
     'code': ['one', 'one', 'two', 'three',
              'two', 'three', 'one', 'two'],
     'colour': ['black', 'white', 'white', 'white',
                'black', 'black', 'white', 'white'],
     'texture': ['soft', 'soft', 'hard', 'soft', 'hard',
                 'hard', 'hard', 'hard'],
     'shape': ['round', 'triangular', 'triangular', 'triangular', 'square',
               'triangular', 'round', 'triangular']
     }, columns=['id', 'code', 'colour', 'texture', 'shape'])
multi_df = df.set_index(
    ['code', 'colour', 'texture', 'shape']).sort_index()['id']

# define the list of indices that I want to look up for in multi_df
look_up_list = [('two', 'white', 'hard', 'triangular'), (
    'five', 'black', 'hard', 'square'), ('four', 'black', 'hard', 'round')]

subdf = multi_df.loc[look_up_list].dropna()
print(subdf ** 2)

yields

(two, white, hard, triangular)     9
(two, white, hard, triangular)    64
Name: id, dtype: float64

Note:

multi_df as defined above is a Series, not a DataFrame. I don't think that affects the solution though.
The code you posted above raises IndexingError: Too many indexers so I'm guessing (a little bit) at the intention of the code.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow