Speed up vlookup like operation using pandas in python

https://stackoverflow.com/questions/21961970

15-10-2022
|

Frage

I have written some code to essentially do a excel style vlookup on two pandas dataframes and want to speed it up.

The structure of the data frames is as follows: dbase1_df.columns:
'VALUE', 'COUNT', 'GRID', 'SGO10GEO'

merged_df.columns:
'GRID', 'ST0, 'ST1', 'ST2', 'ST3', 'ST4', 'ST5', 'ST6', 'ST7', 'ST8', 'ST9', 'ST10'

sgo_df.columns:
'mkey', 'type'

To combine them, I do the following:
1. For each row in dbase1_df, find the row where its 'SGO10GEO' value matches the 'mkey' value of sgo_df. Obtain the 'type' from that row in sgo_df.

'type' contains an integer ranging from 0 to 10. Create a column name by appending 'ST' to type.
Find the value in merged_df, where its 'GRID' value matches the 'GRID' value in dbase1_df and the column name is the one we obtained in step 2. Output this value into a csv file.

// Read in dbase1 dbf into data frame

dbase1_df = pandas.DataFrame.from_csv(dbase1_file,index_col=False)
merged_df = pandas.DataFrame.from_csv('merged.csv',index_col=False)

lup_out.writerow(["VALUE","TYPE",EXTRACT_VAR.upper()])
// For each unique value in dbase1 data frame:
for index, row in dbase1_df.iterrows():

# 1. Find the soil type corresponding to the mukey
tmp  = sgo_df.type.values[sgo_df['mkey'] == int(row['SGO10GEO'])]
if tmp.size > 0:        
    s_type = 'ST'+tmp[0]
    val       = int(row['VALUE'])            

    # 2. Obtain hmu value
    tmp_val  = merged_df[s_type].values[merged_df['GRID'] == int(row['GRID'])] 
    if tmp_val.size > 0:
        hmu_val = tmp_val[0]             
        # 4. Output into data frame: VALUE, hmu value
        lup_out.writerow([val,s_type,hmu_val])
    else:
        err_out.writerow([merged_df['GRID'], type, row['GRID']])

Is there anything here that might be a speed bottleneck? Currently it takes me around 20 minutes for around ~500,000 rows in dbase1_df; ~1,000 rows in merged_df and ~500,000 rows in sgo_df.

thanks!

Lösung

You need to use the merge operation in Pandas to get a better performance. I'm not able to test the below code since I don't have the data but at minimum it should help you to get the idea:

import pandas as pd

dbase1_df = pd.DataFrame.from_csv('dbase1_file.csv',index_col=False)
sgo_df = pd.DataFrame.from_csv('sgo_df.csv',index_col=False)
merged_df = pd.DataFrame.from_csv('merged_df.csv',index_col=False)

#you need to use the same column names for common columns to be able to do the merge operation in pandas , so we changed the column name to mkey

dbase1_df.columns = [u'VALUE', u'COUNT', u'GRID', u'mkey']

#Below operation merges the two dataframes
Step1_Merge = pd.merge(dbase1_df,sgo_df)

#We need to add a new column to concatenate ST and type
Step1_Merge['type_2'] = Step1_Merge['type'].map(lambda x: 'ST'+str(x))

# We need to change the shape of merged_df and move columns to rows to be able to do another merge operation
id = merged_df.ix[:,['GRID']]
a = pd.merge(merged_df.stack(0).reset_index(1), id, left_index=True, right_index=True)

# We also need to change the automatically generated name to type_2 to be able to do the next merge operation
a.columns = [u'type_2', 0, u'GRID']


result = pd.merge(Step1_Merge,a,on=[u'type_2',u'GRID'])

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow