質問

I have a data.table with a row for each day over a 30 year period with a number of different variable columns. The reason for using data.table is that the .csv file I'm using is huge (approx 1.2 million rows) as there are 30 years worth of data for a number of groups charactertised by a column called 'key'.

An example dataset is shown below:

Key   Date          Runoff
A     1980-01-01    2
A     1980-01-02    1
A     1981-01-01    0.1
A     1981-01-02    3
A     1982-01-01    2
A     1982-01-02    5
B     1980-01-01    1.5
B     1980-01-02    0.5
B     1981-01-01    0.3
B     1981-01-02    2
B     1982-01-01    1.5
B     1982-01-02    4

The above is a sample of two 'keys', with some data for January over three years to show what I mean. The actual dataset has hundreds of 'keys' and 30 years worth of data for each 'key'.

What I want to do is produce an output that has the total average for each month for each key as is shown below:

Key   January  February  March.... etc
A     4.36     ...       ...
B     3.26     ...       ...

i.e. the total average for January for Key A = (2 + 1) + (0.1 + 3) + (2 + 5) / 3

When I have done this analysis on one thirty year dataset (i.e. just one key) I have used the following code successfully to achieve this:

runoff_tot_average <- rowsum(DF$Runoff, format(DF$Date, '%m')) / 30

Where DF is the dataframe for one 30 year dataset.

So could I please have suggestions on how to modify my code above to work with the larger dataset with many 'keys' or offer a completely new solution!

EDIT

The below code produces the above data example:

Key <- c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B")
Date <- as.Date(c("1980-01-01", "1980-01-02", "1981-01-01", "1981-01-02", "1982-01-01", "1982-01-02", "1980-01-01", "1980-01-02", "1981-01-01", "1981-01-02", "1982-01-01", "1982-01-02"))
Runoff <- c(2, 1, 0.1, 3, 2, 5, 1.5, 0.5, 0.3, 2, 1.5, 4)
DT <- data.table(Key, Date, Runoff)
役に立ちましたか?

解決

They only way I could think of doing it was in two steps. Probably not the best way, but here goes

DT[, c("YM", "Month") := list(substr(Date, 1, 7), substr(Date, 6, 7))]
DT[, Runoff2 := sum(Runoff), by = c("Key", "YM")]
DT[, mean(Runoff2), by = c("Key", "Month")]

##   Key Month       V1
## 1:   A    01 4.366667
## 2:   B    01 3.266667

Just to show another (very similar) way:

DT[, c("year", "month") := list(year(Date), month(Date))]
DT[, Runoff2 := sum(Runoff), by=list(Key, year, month)]
DT[, mean(Runoff2), by=list(Key, month)]

Note that you don't have to create new columns, as by supports expressions as well. That is, you can directly use them in by as follows:

DT[, Runoff2 := sum(Runoff), by=list(Key, year = year(Date), month = month(Date))]

But since you require to aggregate more than once, it's better (for speed) to store them as additional columns, as @David has shown here.

他のヒント

If you're not looking for complicated functions and just want the mean, then the following should suffice:

DT[, sum(Runoff) / length(unique(year(Date))), list(Key, month(Date))]
#   Key month       V1
#1:   A     1 4.366667
#2:   B     1 3.266667

Since you said in your question that you would be open to a completely new solution, you could try the following with dplyr:

df$Date <- as.Date(df$Date, format="%Y-%m-%d")
df$Year.Month <- format(df$Date, '%Y-%m')
df$Month <- format(df$Date, '%m')

require(dplyr)

df %>%
  group_by(Key, Year.Month, Month) %>%
  summarize(Runoff = sum(Runoff)) %>%
  ungroup() %>%
  group_by(Key, Month) %>%
  summarize(mean(Runoff))

EDIT #1 after comment by @Henrik: The same can be done by:

df %>%
  group_by(Key, Month, Year.Month) %>%
  summarize(Runoff = sum(Runoff)) %>%
  summarize(mean(Runoff))

EDIT #2 to round things up: This is another way of doing it (the second grouping is more explicit this way) thanks to @Henrik for his comments

df %>%
  group_by(Key, Month, Year.Month) %>%
  summarize(Runoff = sum(Runoff)) %>%
  group_by(Key, Month, add = FALSE) %>%    #now grouping by Key and Month, but not Year.Month
  summarize(mean(Runoff))

It produces the following result:

#Source: local data frame [2 x 3]
#Groups: Key
#
#  Key Month mean(Runoff)
#1   A    01     4.366667
#2   B    01     3.266667

You can then reshape the output to match your desired output using e.g. reshape2. Suppose you stored the output of the above operation in a data.frame df2, then you could do:

require(reshape2)

df2 <- dcast(df2, Key  ~ Month, sum, value.var = "mean(Runoff)")
ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top