Question

I am trying to merge >=2 files with the same schema.
The files will contain duplicate entries but rows won't be identical, for example:

file1:
store_id,address,phone
9191,9827 Park st,999999999
8181,543 Hello st,1111111111

file2:
store_id,address,phone
9191,9827 Park st Apt82,999999999
7171,912 John st,87282728282

Expected output:
9191,9827 Park st Apt82,999999999
8181,543 Hello st,1111111111
7171,912 John st,87282728282

If you noticed : 9191,9827 Park st,999999999 and 9191,9827 Park st Apt82,999999999 are similar based on store_id and phone but I picked it up from file2 since the address was more descriptive.

store_id+phone_number was my composite primary key to lookup a location and find duplicates (store_id is enough to find it in the above example but I need a key based on multiple column values)

Question:
- I need to merge multiple CSV files with same schema but with duplicate rows.
- Where the row level merge should have the logic to pick a specific value of a row based on its value. Like phone picked up from file1 and address pickedup from file2.
- A combination of 1 or many column values will define if rows are duplicate or not.

Can this be achieved using pandas?

Was it helpful?

Solution

One way to smash them together is to use merge (on store_id and number, if these are the index then this would be a join rather than a merge):

In [11]: res = df1.merge(df2, on=['store_id', 'phone'], how='outer')

In [12]: res
Out[12]:
   store_id     address_x        phone           address_y
0      9191  9827 Park st    999999999  9827 Park st Apt82
1      8181  543 Hello st   1111111111                 NaN
2      7171           NaN  87282728282         912 John st

You can then use where to select address_y if it exists, otherwise address_x:

In [13]: res['address'] = res.address_y.where(res.address_y, res.address_x)

In [14]: del res['address_x'], res['address_y']

In [15]: res
Out[15]: 
   store_id        phone             address
0      9191    999999999  9827 Park st Apt82
1      8181   1111111111        543 Hello st
2      7171  87282728282         912 John st

OTHER TIPS

How about use concat, groupby, agg, then you can write a agg function to choose the right value:

import pandas as pd
import io

t1 = """store_id,address,phone
9191,9827 Park st,999999999
8181,543 Hello st,1111111111"""

t2 = """store_id,address,phone
9191,9827 Park st Apt82,999999999
7171,912 John st,87282728282"""

df1 = pd.read_csv(io.BytesIO(t1))
df2 = pd.read_csv(io.BytesIO(t2))

df = pd.concat([df1, df2]).reset_index(drop=True)

def f(s):
    loc = s.str.len().idxmax()
    return s[loc]

df.groupby(["store_id", "phone"]).agg(f)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top