pandas reindex a dataframe with duplicate keys

https://stackoverflow.com/questions/14868329

09-03-2022
|

Question

Here is an example of the problem:

>>> df = DataFrame({'a':[1,2]},index=[datetime.today(),datetime.today()+timedelta(days=1)])
>>> df
                            a
2013-02-15 09:36:14.665272  1
2013-02-16 09:36:14.857322  2
>>> dup_index = datetime.today()
>>> df2 = DataFrame({'a':[2,3]},index=[dup_index,dup_index])
>>> df2
                            a
2013-02-15 09:37:11.701271  2
2013-02-15 09:37:11.701271  3
>>>
>>> df2.reindex(df.index,method='ffill')
Traceback (most recent call last):
...
Exception: Reindexing only valid with uniquely valued Index objects

I wish to merge df2 with df. Because the index times do not match up I wish to match the df2 time with the closest last time in df, which is the first row. One artificial way I had come up with to solve this was to add a fake microsecond value to the second time series so that it becomes unique. But this is slow for big dataframes. Is there a particular reason why this is not allowed? It seems like a logical thing to do. Are there any better ways for me to overcome this limitation?

Solution

I ran into a similar problem recently. I solved it by first removing duplicates from df2. Doing it this way makes you think about which one to keep and which to discard. Unfortunately, pandas doesn't seem to have a great way to remove duplicates based on duplicate index entries, but this workaround (adding an 'index' column to df2) should do it:

>>> df2['index'] = df2.index
>>> df3 = df2.drop_duplicates(cols='index', take_last=True).reindex(df.index, method='ffill')
>>> del df3['index']
>>> df3
                             a
2013-02-21 09:51:56.615338 NaN
2013-02-22 09:51:56.615357   3

Of course you could set 'take_last=False' to get a value of 2 for the a column.

I noticed that you said "I wish to match the df2 time with the closest last time in df, which is the first row". I didn't quite understand this statement. The closest times in df to the time in df2 is the second row, not the first row. If I misunderstood your question, let me know and I'll update this answer.

For reference, here is my test data:

>>> df
                            a
2013-02-21 09:51:56.615338  1
2013-02-22 09:51:56.615357  2
>>> df2
                            a
2013-02-21 09:51:57.802331  2
2013-02-21 09:51:57.802331  3

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow