Вопрос

I have a question about creating lag variables depending on a time factor.

Basically I am working with a baseball dataset where there are lots of names for each player between 2002-2012. Obviously I only want lag variables for the same person to try and create a career arc to predict the current stat. Like for example I want to use lag 1 Average (2003) , lag 2 Average (2004) to try and predict the current average in 2005. So I tried to write a loop that goes through every row (the data frame is already sorted by name and then year, so the previous year is n-1 row), check if the name is the same, and if so then grab the value from the previous row.

Here is my loop:

i=2 #as 1 errors out with 1-0 row
for(i in 2:6264){
if(TS$name[i]==TS$name[i-1]){
TS$runvalueL1[i]=TS$Run_Value[i-1]
}else{
TS$runvalueL1 <- NA
}
i=i+1
}

Because each row is dependent on the name I cannot use most of the lag functions. If you have a better idea I am all ears!

Sample Data won't help a bunch but here is some:

edit: Sample data wasn't producing useable results so I just attached the first 10 people of my dataset. Thanks!

TS[(6:10),c('name','Season','Run_Value')]
               name     Season    ARuns
321           Abad Andy   2003     -1.05
3158 Abercrombie Reggie   2006     27.42
1312 Abercrombie Reggie   2007      7.65
1069 Abercrombie Reggie   2008      5.34
4614    Abernathy Brent   2002     46.71
707     Abernathy Brent   2003     -2.29
1297    Abernathy Brent   2005      5.59
6024        Abreu Bobby   2002    102.89
6087        Abreu Bobby   2003    113.23
6177        Abreu Bobby   2004    128.60

Thank you!

Это было полезно?

Решение

Smth along these lines should do it:

names = c("Adams","Adams","Adams","Adams","Bobby","Bobby", "Charlie")
years = c(2002,2003,2004,2005,2004,2005,2010)
Run_value = c(10,15,15,20,10,5,5)

library(data.table)
dt = data.table(names, years, Run_value)

dt[, lag1 := c(NA, Run_value), by = names]
#     names years Run_value lag1
#1:   Adams  2002        10   NA
#2:   Adams  2003        15   10
#3:   Adams  2004        15   15
#4:   Adams  2005        20   15
#5:   Bobby  2004        10   NA
#6:   Bobby  2005         5   10
#7: Charlie  2010         5   NA

Другие советы

An alternative would be to split the data by name, use lapply with the lag function of your choice and then combine the splitted data again:

TS$runvalueL1 <- do.call("rbind", lapply(split(TS, list(TS$name)), your_lag_function))

or

TS$runvalueL1 <- do.call("c", lapply(split(TS, list(TS$name)), your_lag_function))

But I guess there is also a nice possibility with plyr, but as you did not provide a reproducible example, that is all for the beginning.

Better:

TS$runvalueL1 <- unlist(lapply(split(TS, list(TS$name)), your_lag_function))

This is obviously not a problem where you want to create a matrix with cbind, so this is a better data structure:

full=data.frame(names, years, Run_value)

The ave function is quite useful for constructing new columns within categories of other columns:

full$Lag1 <- ave(full$Run_value, full$names, 
          FUN= function(x) c(NA, x[-length(x)] )  )
full
    names years Run_value Lag1
1   Adams  2002        10   NA
2   Adams  2003        15   10
3   Adams  2004        15   15
4   Adams  2005        20   15
5   Bobby  2004        10   NA
6   Bobby  2005         5   10
7 Charlie  2010         5   NA

I thinks it's safer to cionstruct with NA, since that will help prevent errors in logic that using 0 for prior years in year 1 would not alert you to.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top