Question

I have a very large data set (> 1 million rows) with percentiles that need to be calculated for all of the same day (e.g., all Jan 1, all Jan 2, ..., all Dec 31). There are many rows of the same year, month and day with different data. Below is an example of the data:

    Year  Month  Day  A  B  C  D
    2007  Jan    1    1  2  3  4
    2007  Jan    1    5  6  7  8
    2007  Feb    1    1  2  3  4
    2007  Feb    1    5  6  7  8
    .
    .
    2010  Dec    30   1  2  3  4
    2010  Dec    30   5  6  7  8
    2010  Dec    31   1  2  3  4
    2010  Dec    31   5  6  7  8

So to calculate the 95th percentile for Jan 1, it would need to include all Jan 1 for all years (e.g., 2007-2010) and for all columns (A, B, C and D). This is then done for all Jan 2, Jan 3, ..., Dec 30 and Dec 31. This can easily be done with small data sets in Excel by using nested if statements; e.g., ={PERCENTILE(IF(Month($B$2:$B$1000000)="Jan",IF(Day($C$2:$C$1000000)="1",$D$2:$G$1000000)),95%)}

The percentiles could then be added to a a new data table containing only the month and days:

    Month  Day  P95  P05
    Jan    1
    Jan    2
    Jan    3
    .
    .
    Dec    30
    Dec    31

Then using the percentiles, I need to evaluate whether each data value in column names A, B, C and D for their respective date (e.g., Jan 1) is larger than P95 or smaller than P05. Then new columns could be added to the first data table containing 1 or 0 (1 if larger or smaller, 0 if not larger or smaller than the percentiles):

    Year  Month  Day  A  B  C  D  A05  B05  C05  D05  A95  B95  C95  D95
    2007  Jan    1    1  2  3  4  1    0    0    0    0    0    0    0
    2007  Jan    1    5  6  7  8  0    0    0    0    0    0    1    1
    .
    .
    2010  Dec    31   5  6  7  8  0    0    0    0    0    0    0    1
Était-ce utile?

La solution

I've called your data dat:

library(plyr)
library(reshape2)

# melt values so all values are in 1 column
dat_melt <- melt(dat, id.vars=c("Year", "Month", "Day"), variable.name="letter", value.name="value")

# get quantiles, split by day
dat_quantiles <- ddply(dat_melt, .(Month, Day), summarise, 
                   P05=quantile(value, 0.05), P95=quantile(value, 0.95))

# merge original data with quantiles
all_dat <- merge(dat_melt, dat_quantiles)

# See if in bounds
all_dat <- transform(all_dat, less05=ifelse(value < P05, 1, 0), greater95=ifelse(value > P95, 1, 0))


   Month Day Year letter value  P05  P95 less05 greater95
1    Dec  30 2010      A     1 1.35 7.65      1         0
2    Dec  30 2010      A     5 1.35 7.65      0         0
3    Dec  30 2010      B     2 1.35 7.65      0         0
4    Dec  30 2010      B     6 1.35 7.65      0         0
5    Dec  30 2010      C     3 1.35 7.65      0         0
6    Dec  30 2010      C     7 1.35 7.65      0         0
7    Dec  30 2010      D     4 1.35 7.65      0         0
8    Dec  30 2010      D     8 1.35 7.65      0         1
9    Dec  31 2010      A     1 1.35 7.65      1         0
10   Dec  31 2010      A     5 1.35 7.65      0         0
11   Dec  31 2010      B     2 1.35 7.65      0         0
12   Dec  31 2010      B     6 1.35 7.65      0         0
13   Dec  31 2010      C     3 1.35 7.65      0         0
14   Dec  31 2010      C     7 1.35 7.65      0         0
15   Dec  31 2010      D     4 1.35 7.65      0         0
16   Dec  31 2010      D     8 1.35 7.65      0         1
17   Feb   1 2007      A     1 1.35 7.65      1         0
18   Feb   1 2007      A     5 1.35 7.65      0         0
19   Feb   1 2007      B     2 1.35 7.65      0         0
20   Feb   1 2007      B     6 1.35 7.65      0         0
21   Feb   1 2007      C     3 1.35 7.65      0         0
22   Feb   1 2007      C     7 1.35 7.65      0         0
23   Feb   1 2007      D     4 1.35 7.65      0         0
24   Feb   1 2007      D     8 1.35 7.65      0         1
25   Jan   1 2007      A     1 1.35 7.65      1         0
26   Jan   1 2007      A     5 1.35 7.65      0         0
27   Jan   1 2007      B     2 1.35 7.65      0         0
28   Jan   1 2007      B     6 1.35 7.65      0         0
29   Jan   1 2007      C     3 1.35 7.65      0         0
30   Jan   1 2007      C     7 1.35 7.65      0         0
31   Jan   1 2007      D     4 1.35 7.65      0         0
32   Jan   1 2007      D     8 1.35 7.65      0         1

Autres conseils

Something along these lines can be merged to the original dataframe:

aggregate(dfrm[ , c("A","B","C","D")] , list(dfrm$month, dfrm$day), 
                                              FUN=quantile, probs=c(0.05,0.95))

Notice I suggested merge(). Your description suggested (but was not explicit) that you wanted all years worth of Jan-1 values to be submitted together. I think this is a lot "easier" than the expression you are using in Excel. This does both 0.05 and 0.95 on all four columns.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top