R: Split-Apply-Combine... Apply Functions via Aggregate to Row-Bound Data Frames Subset by Class

https://stackoverflow.com/questions/17891491

04-06-2022
|

Question

Update: My NOAA GHCN-Daily weather station data functions have since been cleaned and merged into the rnoaa package, available on CRAN or here: https://github.com/ropensci/rnoaa

I'm designing a R function to calculate statistics across a data set comprised of multiple data frames. In short, I want to pull data frames by class based on a reference data frame containing the names. I then want to apply statistical functions to values for the metrics listed for each given day. In effect, I want to call and then overlay a list of data frames to calculate functions on a vector of values for every unique date and metric where values are not NA.

The data frames are iteratively read into the workspace from file based on a class variable, using the 'by' function. After importing the files for a given class, I want to rbind() the data frames for that class and each user-defined metric within a range of years. I then want to apply a concatenation of user-provided statistical functions to each metric within a class that corresponds to a given value for the year, month, and day (i.e., the mean [function] low temperature [class] on July 1st, 1990 [date] reported across all locations [data frames] within a given region [class]. I want the end result to be new data frames containing values for every date within a region and a year range for each metric and statistical function applied. I am very close to having this result using the aggregate() function, but I am having trouble getting reasonable results out of the aggregate function, which is currently outputting NA and NaN for most functions other than the mean temperature. Any advice would be much appreciated! Here is my code thus far:

# Example parameters
w <- c("mean","sd","scale")             # Statistical functions to apply
x <- "C:/Data/"                         # Folder location of CSV files
y <- c("MaxTemp","AvgTemp","MinTemp")   # Metrics to subset the data
z <- c(1970:2000)                       # Year range to subset the data

 CSVstnClass  <- data.frame(CSVstations,CSVclasses)

  by(CSVstnClass, CSVstnClass[,2], function(a){                        # Station list by class
  suppressWarnings(assign(paste(a[,2]),paste(a[,1]),envir=.GlobalEnv))
    apply(a, 1, function(b){                                           # Data frame list, row-wise
      classData   <- data.frame()
      sapply(y, function(d){                                           # Element list
        CSV_DF    <- read.csv(paste(x,b[2],"/",b[1],".csv",sep=""))    # Read in CSV files as data frames
        CSV_DF1   <- CSV_DF[!is.na("Value")]
        CSV_DF2   <- CSV_DF1[which(CSV_DF1$Year %in% z & CSV_DF1$Element == d),]
        assign(paste(b[2],"_",d,sep=""),CSV_DF2,envir=.GlobalEnv)

        if(nrow(CSV_DF2) > 0){                                         # Remove empty data frames
          classData <<- rbind(classData,CSV_DF2)                       # Bind all data frames by row for a class and element
          assign(paste(b[2],"_",d,"_bound",sep=""),classData,envir=.GlobalEnv)

          sapply(w, function(g){                                       # Function list
                                                                       # Aggregate results of bound data frame for each unique date
            dataFunc <- aggregate(Value~Year+Month+Day+Element,data=classData,FUN=g,na.action=na.pass)
            assign(paste(b[2],"_",d,"_",g,sep=""),dataFunc,envir=.GlobalEnv)
            })
        }
        })
      })
    })

I think I am pretty close, but I am not sure if rbind() is performing properly, nor why the aggregate() function is outputting NA and NaN for so many metrics. I was concerned that the data frames were not being bound together or that missing values were not being handled well by some of the statistical functions. Thank you in advance for any advice you can offer.

Cheers,

Adam

Solution

You've tackled this problem in a way that makes it very hard to debug. I'd recommend switching things around so you can more easily check each step. (Using informative variable names also helps!) The code is unlikely to work as is, but it should be much easier to work iteratively, checking that each step has succeeded before continuing to the next.

paths <- dir("C:/Data/", pattern = "\\.csv$")

# Read in CSV files as data frames
raw <- lapply(paths, read.csv, str)

# Extract needed rows
filter_metrics <- c("MaxTemp", "AvgTemp", "MinTemp")
filter_years <- 1970:2000
filtered <- lapply(raw, subset, 
  !is.na(Value) & Year %in% filter_years & Element %in% filter_metrics)

# Drop any empty data frames
rows <- vapply(filtered, nrow, integer(1))
filtered <- filtered[rows > 0]

# Compute aggregates
my_aggregate <- function(df, fun) {
  aggregate(Value ~ Year + Month + Day + Element, data = df, FUN = fun, 
    na.action = na.pass)
}    
means <- lapply(filtered, my_aggregate, mean)
sds <- lapply(filtered, my_aggregate, sd)
scales <- lapply(filtered, my_aggregate, scale)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow