Вопрос

I've been at this for a few hours now, and can't seem to find a solution. I have a very large data frame (upwards of 1.5 million rows), in which I want to do a fairly specific operation. First of all, my data looks like this:

STATION       DATE      Precip
COOP 310     -7788        .24
COOP 310     -7788        .15
COOP 310     -6654        .59
COOP 310     -6654        .10
COOP 499     -7122        .64
COOP 499     -7122        .36
COOP 499     -7122        .14
COOP 499     -2350        .11
COOP 499     -2350        .85

I have a weatehr station id (STATION), a date in UNIX epoch form (DATE), and precipitation values (15-minute data intervals when it does rain). What I've been trying to do is to determine the daily rainfall sums for each day that it rained per station. The desired output would look something like this:

STATION       DATE        24-hour_PRECIP
COOP 310     -7788        0.39
COOP 310     -6654        0.69
COOP 499     -7122        1.14
COOP 499     -2350        0.96

This essentially meant, I thought, doing a SPLIT operation twice, once to split all data based on identical STATION values, and then again based on identical DATE values. Theoretically, this output would then be run through an SAPPLY operation, applying the SUM function to the data set in each unique Date/Station set. My approach (although wrong):

Data frame name is "dfhour":

sp1<-split(dfhour$Precip,dfhour$STATION)

I can do an sapply function fine on this data, but I want to split it even further before using sapply. I know that doing something like

sapply(split(split(dfhour$Precip, dfhour$STATION),dfhour$DATE),FUN=sum)

won't work because the output of a split function is a list, and the next split function would not be able to accept a list as an argument. Does anybody have any guidance on this issue? What other functions could help me get where I need to go?

Это было полезно?

Решение

I think you're just looking for aggregate. If your data.frame is named "mydf":

> aggregate(Precip ~ ., mydf, sum)
   STATION  DATE Precip
1 COOP 310 -7788   0.39
2 COOP 499 -7122   1.14
3 COOP 310 -6654   0.69
4 COOP 499 -2350   0.96

Judging by the size of your data, though, you might want to use data.table instead:

> library(data.table)
data.table 1.8.8  For help type: help("data.table")
> DT <- data.table(mydf, key = "STATION,DATE")
> DT[, list(Precip = sum(Precip)), by = key(DT)]
    STATION  DATE Precip
1: COOP 310 -7788   0.39
2: COOP 310 -6654   0.69
3: COOP 499 -7122   1.14
4: COOP 499 -2350   0.96

Update, as per discussion in comments

Imagine your data were as follows (note the duplicated dates, but at different stations):

mydf <- structure(list(STATION = c("COOP 310", "COOP 310", "COOP 310",                 
     "COOP 310", "COOP 499", "COOP 499", "COOP 499", "COOP 499", "COOP 499",            
     "COOP 499", "COOP 499"), DATE = c(-7788L, -7788L, -6654L, -6654L,                  
     -7122L, -7122L, -7122L, -2350L, -2350L, -7788L, -7788L), Precip = c(0.24,          
     0.15, 0.59, 0.1, 0.64, 0.36, 0.14, 0.11, 0.85, 0.35, 0.17)), .Names = c("STATION", 
     "DATE", "Precip"), row.names = c(NA, 11L), class = "data.frame")
mydf
#     STATION  DATE Precip
# 1  COOP 310 -7788   0.24
# 2  COOP 310 -7788   0.15
# 3  COOP 310 -6654   0.59
# 4  COOP 310 -6654   0.10
# 5  COOP 499 -7122   0.64
# 6  COOP 499 -7122   0.36
# 7  COOP 499 -7122   0.14
# 8  COOP 499 -2350   0.11
# 9  COOP 499 -2350   0.85
# 10 COOP 499 -7788   0.35
# 11 COOP 499 -7788   0.17

Both alternatives presented will generate sums for the combinations of "STATION" and "DATE". Here's the data.table process and result:

DT <- data.table(mydf, key = "STATION,DATE")
DT[, list(Precip = sum(Precip)), by = key(DT)]
#     STATION  DATE Precip
# 1: COOP 310 -7788   0.39
# 2: COOP 310 -6654   0.69
# 3: COOP 499 -7788   0.52
# 4: COOP 499 -7122   1.14
# 5: COOP 499 -2350   0.96

Другие советы

"Upwards of 1.5 million rows" combined with a simple split-apply-combine suggests data.table is the perfect tool for your problem.

I think you'd want something like:

DT[,sum(Precip),by="STATION,DATE"]

Where DT is the data.table form of your data.frame.

You do not need the nested splits. You just need to provide a single "split" argument that captures the crossed levels, perhaps using the interaction function.

tapply( statfrm$Precip, interaction(statfrm$STATION, statfrm$DATE) , sum) 
#----------------
COOP-310.-7788 COOP-499.-7788 COOP-310.-7122 COOP-499.-7122 COOP-310.-6654 
          0.39             NA             NA           1.14           0.69 
COOP-499.-6654 COOP-310.-2350 COOP-499.-2350 
            NA             NA           0.96 

You can also use a split-sapply strategy to get a similar answer and in your cas the zero values may be more appropriate than the NA's you get with tapply:

 sapply(split(statfrm$Precip, interaction(statfrm$STATION, statfrm$DATE) ), sum) 
#-------
COOP-310.-7788 COOP-499.-7788 COOP-310.-7122 COOP-499.-7122 COOP-310.-6654 
          0.39           0.00           0.00           1.14           0.69 
COOP-499.-6654 COOP-310.-2350 COOP-499.-2350 
          0.00           0.00           0.96 

As far as display of this vector, I sometimes wrap as.matrix around a vector to getting to display "downward":

as.matrix(sapply(split(statfrm$Precip, interaction(statfrm$STATION, statfrm$DATE) ), sum))
#_________________
               [,1]
COOP-310.-7788 0.39
COOP-499.-7788 0.00
COOP-310.-7122 0.00
COOP-499.-7122 1.14
COOP-310.-6654 0.69
COOP-499.-6654 0.00
COOP-310.-2350 0.00
COOP-499.-2350 0.96
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top