سؤال

I'm trying to drop rows of a dataframe based on whether they are duplicates, and always keep the more recent of the rows. This would be simple using df.drop_duplicates(), however I also need to apply a timedelta. The row is to be considered a duplicate if the EndDate column is less than 182 days earlier than that of another row with the same ID.

This table shows the rows that I need to drop in the Duplicate column.

   ID   EndDate             Duplicate
0  A    2008-07-31 00:00:00 True
1  A    2008-09-31 00:00:00 False   
2  A    2009-07-31 00:00:00 False
3  A    2010-03-31 00:00:00 False
4  B    2008-07-31 00:00:00 False
5  B    2009-05-31 00:00:00 True
6  B    2009-07-31 00:00:00 False

The input data is not sorted but it seems that right approach is to sort by ID and by EndDate and then test each row against the next row. I think I can do this by looping through the rows, but the dataset is relatively large so is there a more efficient way of doing this in pandas?

هل كانت مفيدة؟

المحلول

I've managed to get the following code to work, but I'm sure it could be improved.

df = df.sort(['ID','EndDate'])
df['Duplicate'] = (df['EndDate'].shift(-1) - df['EndDate']) - datetime.timedelta(182) < 0
df['Duplicate'] = df['Duplicate'] & (df['ID'].shift(-1) == df['ID'])
df = df[df['Duplicate'] == False]
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top