문제

I have a rather large pandas data frame which is a time serie with a lot of different information for each time stamp (eye tracking data).

Part of the data looks a bit like:

In [58]: df
Out[58]:
    time    event
49  44295   NaN
50  44311   NaN
51  44328   NaN
52  44345   2
53  44361   2
54  44378   2
55  44395   2
56  44411   2
57  44428   3
58  44445   3
59  44461   3
60  44478   3 
61  44495   NaN
62  44511   NaN
63  44528   NaN
64  44544   NaN  
65  44561   NaN
66  44578   NaN
67  44594   NaN
68  44611   4
69  44628   4
70  44644   4
71  44661   NaN
72  44678   NaN

I would like to calculate the (time) duration of each event as the max(time)-min(time) for a given event e.g. for event 2: 44411-44345 = 66

This duration I would like in a new column so that the data ends up like this:

In [60]: df
Out[60]:
    time    event    duration
49  44295   NaN      NaN
50  44311   NaN      NaN
51  44328   NaN      NaN
52  44345   2        66
53  44361   2        66
54  44378   2        66
55  44395   2        66
56  44411   2        66
57  44428   3        50
58  44445   3        50
59  44461   3        50
60  44478   3        50
61  44495   NaN      NaN
62  44511   NaN      NaN
63  44528   NaN      NaN
64  44544   NaN      NaN
65  44561   NaN      NaN
66  44578   NaN      NaN
67  44594   NaN      NaN
68  44611   4        33
69  44628   4        33
70  44644   4        33
71  44661   NaN      NaN
72  44678   NaN      NaN

How can I do that?

도움이 되었습니까?

해결책

One way would be to use groupby and transform. max - min is also called peak-to-peak, or ptp for short, and so ptp here basically means for lambda x: x.max() - x.min().

>>> df = pd.read_csv("eye.csv",sep="\s+")
>>> df["duration"] = df.dropna().groupby("event")["time"].transform("ptp")
>>> df
     time  event  duration
49  44295    NaN       NaN
50  44311    NaN       NaN
51  44328    NaN       NaN
52  44345      2        66
53  44361      2        66
54  44378      2        66
55  44395      2        66
56  44411      2        66
57  44428      3        50
58  44445      3        50
59  44461      3        50
60  44478      3        50
61  44495    NaN       NaN
62  44511    NaN       NaN
63  44528    NaN       NaN
64  44544    NaN       NaN
65  44561    NaN       NaN
66  44578    NaN       NaN
67  44594    NaN       NaN
68  44611      4        33
69  44628      4        33
70  44644      4        33
71  44661    NaN       NaN
72  44678    NaN       NaN

The dropna was to prevent each NaN value in the event column from being considered its own event. (There's also something weird going on in how ptp works when the key is NaN too, but that's a separate issue.)

다른 팁

Iterate over records using groupby from itertools. Group criteria shall be the event number. As you have the data properly ordered (all event codes related to the same event are not interrupted by others), there is no need to do sorting on even code.

groupby will iteratively return tuples (key, group), where key is the even code and group is list of all the records.

From the records, pick up minimal and maximal time and calculate duration.

Then, do your work to get durations as new field to your records.

There might be more efficient methods using pandas, which I am not aware of. Described solution does not require pandas.

I ended up doing the following work around to the posted answer by @DSM:

df["dur"] = datalist[i][j].groupby("event")["time"].transform("ptp")
dur = []
for i in datalist.index:
    if np.isnan(df["event"][i]): 
        dur.append(df["event"][i])
    else:
        dur.append(df["dur"][i])
df["Duration"] = dur

This at least works for me.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top