Applying an lm function to different ranges of data and separate groups using data.table

https://stackoverflow.com/questions/21793593

12-10-2022
|

Pregunta

How do I perform a linear regression using different intervals for data in different groups in a data.table? I am currently doing this using plyr but with large data sets it gets very slow. Any help to speed up the process is greatly appreciated.

I have a data table which contains 10 counts of CO2 measurements over 10 days, for 10 plots and 3 fences. Different days fall into different time periods, as described below.

I would like to perform a linear regression to determine the rate of change of CO2 for each fence, plot and day combination using a different interval of counts during each period. Period 1 should regress CO2 during counts 1-5, period 2 using 1-7 and period 3 using 1-9.

CO2 <- rep((runif(10, 350,359)), 300) # 10 days, 10 plots, 3 fences
count <- rep((1:10), 300) # 10 days, 10 plots, 3 fences
DOY <-rep(rep(152:161, each=10),30) # 10 measurements/day, 10 plots, 3 fences
fence <- rep(1:3, each=1000) # 10 days, 10 measurements, 10 plots 
plot <- rep(rep(1:10, each=100),3) # 10 days, 10 measurements, 3 fences
flux <- as.data.frame(cbind(CO2, count, DOY, fence, plot))
flux$period <- ifelse(flux$DOY <= 155, 1, ifelse(flux$DOY > 155 & flux$DOY < 158, 2, 3))
flux <- as.data.table(flux)

I expect an output which gives me the R2 fit and slope of the line for each plot, fence and DOY.

The data I have provided is a small subsample, my real data has 1*10^6 rows. The following works, but is slow:

model <- function(df)
{lm(CO2 ~ count, data = subset(df, ifelse(df$period == 1,count>1 &count<5,
ifelse(df$period == 2,count>1 & count<7,count>1 & count<9))))}

model_flux <- dlply(flux, .(fence, plot, DOY), model)

rsq <- function(x) summary(x)$r.squared
coefs_flux <- ldply(model_flux, function(x) c(coef(x), rsquare = rsq(x)))
names(coefs_flux)[1:5] <- c("fence", "plot", "DOY", "intercept", "slope")

Solución

Here is a "data.table" way to do this:

library(data.table)
flux <- as.data.table(flux)
setkey(flux,count)
flux[,include:=(period==1 & count %in% 2:4) | 
                (period==2 & count %in% 2:6) | 
                (period==3 & count %in% 2:8)]
flux.subset <- flux[(include),]
setkey(flux.subset,fence,plot,DOY)

model <- function(df) {
  fit <- lm(CO2 ~ count, data = df)
  return(list(intercept=coef(fit)[1], 
              slope=coef(fit)[2],
              rsquare=summary(fit)$r.squared))
}
coefs_flux <- flux.subset[,model(.SD),by="fence,plot,DOY"]

Unless I'm missing something, the subsetting you do in each call to model(...) is unnecessary. You can segment the counts by period in one step at the beginning. This code yields the same results as yours, except that dlply(...) returns a data frame and this code produces a data table. It isn't much faster on this test dataset.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow