Extracting values from rows which meet a condition in R
-
26-12-2019 - |
質問
The data set
I have a big data frame with millions of rows and more than 20 columns. Let me first describe what the data is to make question more clear. The original data frame consists of locations, velocities and accelerations of 2169 vehicles during a 15 minute period. Each vehicle has a unique Vehicle.ID
, an ID of the time frame in which it was observed i.e. Frame.ID
, the velocity of vehicle in that frame i.e. svel
, the acceleration of vehicle in that frame i.e. sacc
and the class of that vehicle, vehicle.class
, i.e. 1= motorcycle, 2= car, 3 = truck. These variables were recorded after every 0.1 seconds i.e. each frame is 0.1 seconds. Here are the first 6 rows:
> dput(head(df))
structure(list(Vehicle.ID = c(2L, 2L, 2L, 2L, 2L, 2L), Frame.ID = 133:138,
Vehicle.class = c(2L, 2L, 2L, 2L, 2L, 2L), Lane = c(2L, 2L,
2L, 2L, 2L, 2L), svel = c(37.29, 37.11, 36.96, 36.83, 36.73,
36.64), sacc = c(0.07, 0.11, 0.15, 0.19, 0.22, 0.25)), .Names = c("Vehicle.ID",
"Frame.ID", "Vehicle.class", "Lane", "svel", "sacc"), row.names = 7750:7755, class = "data.frame")
There are some instances in vehicles' journey during the 15 minute recording period that they completely stop i.e. svel==0
. This continues for some frames and then vehicles gain speed again. For the purpose of reproduciblity I am creating an example data set as follows:
x <- data.frame(Vehicle.ID = c(rep(10,5), rep(20,5), rep(30,5), rep(40,5), rep(50,5)),
vehicle.class = c(rep(2,10), rep(3,10),rep(1,5)),
svel = rep(c(1,0,0,0,3),5),
sacc = rep(c(0.3,0.001,0.001,0.002,0.5),5))
What do I want to find?
As described above some vehicles stop and have zero velocity for some time but later accelerate to get up to speed. I want to find the acceleration, sacc
they apply after having zero velocity for some time (moving from standstill position). This means that I should be able to look at the FIRST row AFTER the last frame in which svel==0
. In the example data this means that the car (vehicle.class==2
) having a Vehicle.ID==10
had a velocity, svel
equal to 1 as seen in the first row. Later, it stopped for 3 frames (3 consecutive rows) and then accelerated to velocity, svel
, equal to 3. I want the acceleration sacc
it applied in those 2 frames (rows 4 and 5 for vehicle 10, which come out to be 0.002 and 0.500). This means that for example data, following should be the output by vehicle.class
:
output <- data.frame(Vehicle.ID = c(10,10,20,20,30,30,40,40,50, 50),
vehicle.class = c(2,2,2,2,3,3,3,3,1,1),
xf = rep(c('l','f'),10),
sacc = rep(c(0.002,0.500),5))
xf
identifies the last row l
in which svel==0
and f
is the first one after that. I have tried using plyr
and for loop
to split by vehicle.class
but am not sure how to extract the sacc
.
Note
xf
should be a part of output. It is not in given data.- The original data frame
df
has 2169 vehicles, some stopped and some did not so not all vehicles hadsvel==0
. - The vehicles which did stop didn't do it at the same time. Also, the
number of rows in which
svel==0
is different vehicle to vehicle.
解決
There may be a more elegant way to do this, but this works:
require(data.table)
x <- data.table(x) ## much easier as data.table
x[, xf:='n'] ## create vector with 'n', neither first nor last
# create diff(svel) shifted upwards,
# padding last observation with 0 to avoid cycling
x[, dsvel:=c(diff(svel, lag=1), 0), by=Vehicle.ID]
# svel is zero and dsvel positive at the last 0 value
x[svel==0 & dsvel > 0, xf:='l']
# there may be a better way to do this part
# get index of observation next to 'l'
# there is no risk of spilling to next Vehicle.ID,
# because 'l' can only be second to last
i <- which(x$xf=='l') + 1
x[i, xf:='f']
That should give you the xf
vector you want.
Edit from Arun: +1 @ilir, a very nice answer. Here's another way you could do it with the use of data.table
's inbuilt variables .I
and .N
:
idx = x[, {
ix = tail(.I[svel==0L], 1);
iy = (ix+1L)*((ix+1L) <= .I[.N] | NA)
list(idx = c(ix, iy))
}, by = list(Vehicle.ID, vehicle.class)]$idx
You can now subset with idx
add l
and f
with :=
as follows:
ans <- x[idx][, xf := c("l", "f")]
Vehicle.ID vehicle.class svel sacc xf
1: 10 2 0 0.002 l
2: 10 2 3 0.500 f
3: 20 2 0 0.002 l
4: 20 2 3 0.500 f
5: 30 3 0 0.002 l
6: 30 3 3 0.500 f
7: 40 3 0 0.002 l
8: 40 3 3 0.500 f
9: 50 1 0 0.002 l
10: 50 1 3 0.500 f
.I
contains the row numbers of x
for each group. .N
contains the number of observations for each group. Please read ?data.table
for more.
ix
gets the last occurrence of the 0. We subset the row number corresponding to the last 0, for each group, using tail
.
iy
normally should be the next entry = ix+1L
. But since the 0 may be the last entry for some group, we check if it is so by comparing (ix+1L) <= .I[.N]
. If it's FALSE that means ix
is the last entry and so we've to output NA, else we've to output (ix+1L)
.
HTH.
他のヒント
I think I've come up with a reasonably elegant way of representing the problem with dplyr. For each car, we're interested in the rows where it's not stopped in this row, but was stopped in the previous row:
library(dplyr)
df <- tbl_df(data.frame(
id = c(rep(10, 5), rep(20, 5), rep(30, 5), rep(40, 5), rep(50, 5)),
class = c(rep(2, 10), rep(3, 10), rep(1, 5)),
svel = rep(c(1, 0, 0, 0, 3), 5),
sacc = rep(c(0.3, 0.001, 0.001, 0.002, 0.5), 5)
))
df %.% group_by(id) %.%
mutate(stopped = svel == 0) %.%
filter(lag(stopped) == TRUE, stopped == FALSE)
#> Source: local data frame [5 x 5]
#> Groups: id
#>
#> id class svel sacc stopped
#> 1 10 2 3 0.5 FALSE
#> 2 20 2 3 0.5 FALSE
#> 3 30 3 3 0.5 FALSE
#> 4 40 3 3 0.5 FALSE
#> 5 50 1 3 0.5 FALSE
You could write this a little more compactly as
df %.% group_by(id) %.%
mutate(stopped = svel == 0) %.%
filter(lag(stopped), !stopped)
#> Source: local data frame [5 x 5]
#> Groups: id
#>
#> id class svel sacc stopped
#> 1 10 2 3 0.5 FALSE
#> 2 20 2 3 0.5 FALSE
#> 3 30 3 3 0.5 FALSE
#> 4 40 3 3 0.5 FALSE
#> 5 50 1 3 0.5 FALSE
Not sure I totally understand the question, but I think this is what you are after:
x <- data.frame(Vehicle.ID = c(rep(10,5), rep(20,5), rep(30,5), rep(40,5), rep(50,5)),
vehicle.class = c(rep(2,10), rep(3,10),rep(1,5)),
svel = rep(c(1,0,0,0,3),5),
sacc = rep(c(0.3,0.001,0.001,0.002,0.5),5)
)
# find "l" rows, the last row for a given Vehicle.ID where svel==0
l <- FALSE
l[x$svel==0] <- !duplicated(x$Vehicle.ID[x$svel==0], fromLast=TRUE)
# extract all rows following an l row.
x[which(l) + 1, c(1, 2, 4)]
library(data.table)
x = data.table(x)
output = x[xf == "f",sacc.after.zero := sacc, by = vehicle.class]
output[!is.na(sacc.after.zero),]