Finding rows by their difference in Pandas dataframe

https://stackoverflow.com/questions/23666232

22-07-2023
|

Question

I have a data frame in which I want to identify all pairs of rows whose time value t differs by a fixed amount, say diff.

In [8]: df.t
Out[8]:
0    143.082739
1    316.285739
2    344.315561
3    272.258814
4    137.052583
5    258.279331
6    114.069608
7    159.294883
8    150.112371
9    181.537183
...

For example, if diff = 22.2423, then we would have a match between rows 4 and 7.

The obvious way to find all such matches is to iterate over each row and apply a filter to the data frame:

for t in df.t:
    matches = df[abs(df.t - (t + diff)) < EPS]
    # log matches

But as I have a log of values (10000+), this will be quite slow.

Further, I want to look and check to see if any differences of a multiple of diff exist. So, for instance, rows 4 and 9 differ by 2 * diff in my example. So my code takes a long time.

Does anyone have any suggestions on a more efficient technique for this?

Thanks in advance.

Edit: Thinking about it some more, the question boils down to finding an efficient way to find floating-point numbers contained in two lists/Series objects, to within some tolerance.

If I can do this, then I can simply compare df.t, df.t - diff, df.t - 2 * diff, etc.

Solution

If you want to check many multiples, it might be best to take the modulo of df with respect to diff and compare the result to zero, within your tolerance.

Whether you use modulo or not, the efficient way to compare floats within some tolerance is numpy.allclose. In versions before 1.8, call it as numpy.testing.allcose.

So far what I've described still involved looping over rows, because you must compare each row to every other. A better, but slightly more involved approach, would use scipy.cKDTree to query all pairs within a given distance (tolerance).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow