Extracting values from rows which meet a condition in R

https://stackoverflow.com//questions/23064471

26-12-2019
|

質問

The data set

I have a big data frame with millions of rows and more than 20 columns. Let me first describe what the data is to make question more clear. The original data frame consists of locations, velocities and accelerations of 2169 vehicles during a 15 minute period. Each vehicle has a unique Vehicle.ID, an ID of the time frame in which it was observed i.e. Frame.ID, the velocity of vehicle in that frame i.e. svel, the acceleration of vehicle in that frame i.e. sacc and the class of that vehicle, vehicle.class, i.e. 1= motorcycle, 2= car, 3 = truck. These variables were recorded after every 0.1 seconds i.e. each frame is 0.1 seconds. Here are the first 6 rows:

> dput(head(df))
structure(list(Vehicle.ID = c(2L, 2L, 2L, 2L, 2L, 2L), Frame.ID = 133:138, 
    Vehicle.class = c(2L, 2L, 2L, 2L, 2L, 2L), Lane = c(2L, 2L, 
    2L, 2L, 2L, 2L), svel = c(37.29, 37.11, 36.96, 36.83, 36.73, 
    36.64), sacc = c(0.07, 0.11, 0.15, 0.19, 0.22, 0.25)), .Names = c("Vehicle.ID", 
"Frame.ID", "Vehicle.class", "Lane", "svel", "sacc"), row.names = 7750:7755, class = "data.frame")

There are some instances in vehicles' journey during the 15 minute recording period that they completely stop i.e. svel==0. This continues for some frames and then vehicles gain speed again. For the purpose of reproduciblity I am creating an example data set as follows:

x <- data.frame(Vehicle.ID = c(rep(10,5), rep(20,5), rep(30,5), rep(40,5), rep(50,5)),
                    vehicle.class = c(rep(2,10), rep(3,10),rep(1,5)),
                    svel = rep(c(1,0,0,0,3),5),
                    sacc = rep(c(0.3,0.001,0.001,0.002,0.5),5))

What do I want to find?

As described above some vehicles stop and have zero velocity for some time but later accelerate to get up to speed. I want to find the acceleration, sacc they apply after having zero velocity for some time (moving from standstill position). This means that I should be able to look at the FIRST row AFTER the last frame in which svel==0. In the example data this means that the car (vehicle.class==2) having a Vehicle.ID==10 had a velocity, svel equal to 1 as seen in the first row. Later, it stopped for 3 frames (3 consecutive rows) and then accelerated to velocity, svel, equal to 3. I want the acceleration sacc it applied in those 2 frames (rows 4 and 5 for vehicle 10, which come out to be 0.002 and 0.500). This means that for example data, following should be the output by vehicle.class:

output <- data.frame(Vehicle.ID = c(10,10,20,20,30,30,40,40,50, 50),
                     vehicle.class = c(2,2,2,2,3,3,3,3,1,1),
                     xf = rep(c('l','f'),10),
                     sacc = rep(c(0.002,0.500),5))

xf identifies the last row l in which svel==0 and f is the first one after that. I have tried using plyr and for loop to split by vehicle.class but am not sure how to extract the sacc.

Note

xf should be a part of output. It is not in given data.
The original data frame df has 2169 vehicles, some stopped and some did not so not all vehicles had svel==0.
The vehicles which did stop didn't do it at the same time. Also, the number of rows in which svel==0 is different vehicle to vehicle.

解決

There may be a more elegant way to do this, but this works:

require(data.table)
x <- data.table(x)  ## much easier as data.table
x[, xf:='n']        ## create vector with 'n', neither first nor last

# create diff(svel) shifted upwards, 
# padding last observation with 0 to avoid cycling
x[, dsvel:=c(diff(svel, lag=1), 0), by=Vehicle.ID]

# svel is zero and dsvel positive at the last 0 value
x[svel==0 & dsvel > 0, xf:='l']

# there may be a better way to do this part
# get index of observation next to 'l'
# there is no risk of spilling to next Vehicle.ID,  
# because 'l' can only be second to last
i <- which(x$xf=='l') + 1
x[i, xf:='f']

That should give you the xf vector you want.

Edit from Arun: +1 @ilir, a very nice answer. Here's another way you could do it with the use of data.table's inbuilt variables .I and .N:

idx = x[, {
            ix = tail(.I[svel==0L], 1);
            iy = (ix+1L)*((ix+1L) <= .I[.N] | NA) 
            list(idx = c(ix, iy))
          }, by = list(Vehicle.ID, vehicle.class)]$idx

You can now subset with idx add l and f with := as follows:

ans <- x[idx][, xf := c("l", "f")]
    Vehicle.ID vehicle.class svel  sacc xf
 1:         10             2    0 0.002  l
 2:         10             2    3 0.500  f
 3:         20             2    0 0.002  l
 4:         20             2    3 0.500  f
 5:         30             3    0 0.002  l
 6:         30             3    3 0.500  f
 7:         40             3    0 0.002  l
 8:         40             3    3 0.500  f
 9:         50             1    0 0.002  l
10:         50             1    3 0.500  f

.I contains the row numbers of x for each group. .N contains the number of observations for each group. Please read ?data.table for more.

ix gets the last occurrence of the 0. We subset the row number corresponding to the last 0, for each group, using tail.

iy normally should be the next entry = ix+1L. But since the 0 may be the last entry for some group, we check if it is so by comparing (ix+1L) <= .I[.N]. If it's FALSE that means ix is the last entry and so we've to output NA, else we've to output (ix+1L).

HTH.

他のヒント

I think I've come up with a reasonably elegant way of representing the problem with dplyr. For each car, we're interested in the rows where it's not stopped in this row, but was stopped in the previous row:

library(dplyr)
df <- tbl_df(data.frame(
  id = c(rep(10, 5), rep(20, 5), rep(30, 5), rep(40, 5), rep(50, 5)), 
  class = c(rep(2, 10), rep(3, 10), rep(1, 5)), 
  svel = rep(c(1, 0, 0, 0, 3), 5), 
  sacc = rep(c(0.3, 0.001, 0.001, 0.002, 0.5), 5)
))

df %.% group_by(id) %.% 
  mutate(stopped = svel == 0) %.%
  filter(lag(stopped) == TRUE, stopped == FALSE)

#> Source: local data frame [5 x 5]
#> Groups: id
#> 
#>   id class svel sacc stopped
#> 1 10     2    3  0.5   FALSE
#> 2 20     2    3  0.5   FALSE
#> 3 30     3    3  0.5   FALSE
#> 4 40     3    3  0.5   FALSE
#> 5 50     1    3  0.5   FALSE

You could write this a little more compactly as

df %.% group_by(id) %.% 
  mutate(stopped = svel == 0) %.%
  filter(lag(stopped), !stopped)

#> Source: local data frame [5 x 5]
#> Groups: id
#> 
#>   id class svel sacc stopped
#> 1 10     2    3  0.5   FALSE
#> 2 20     2    3  0.5   FALSE
#> 3 30     3    3  0.5   FALSE
#> 4 40     3    3  0.5   FALSE
#> 5 50     1    3  0.5   FALSE

Not sure I totally understand the question, but I think this is what you are after:

x <- data.frame(Vehicle.ID = c(rep(10,5), rep(20,5), rep(30,5), rep(40,5), rep(50,5)),
                vehicle.class = c(rep(2,10), rep(3,10),rep(1,5)),
                svel = rep(c(1,0,0,0,3),5),
                sacc = rep(c(0.3,0.001,0.001,0.002,0.5),5)
)

# find "l" rows, the last row for a given Vehicle.ID where svel==0
l <- FALSE
l[x$svel==0] <- !duplicated(x$Vehicle.ID[x$svel==0], fromLast=TRUE)
# extract all rows following an l row.
x[which(l) + 1, c(1, 2, 4)]

library(data.table)
x = data.table(x)
output = x[xf == "f",sacc.after.zero := sacc, by = vehicle.class]
output[!is.na(sacc.after.zero),]

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow