Finding max, second max, and third max aggregated by two parameters in R or with R's sqldf

Question 1

The following code creates a list of three data frames (dat is your original data frame):

lapply(seq(3), function(x)
  aggregate(Use ~ Year + ID, dat, function(y)
    y[order(-y)][x]))

The result:

[[1]]
  Year  ID Use
1 2009 101 375
2 2010 101 381
3 2009 102 275
4 2010 102 469

[[2]]
  Year  ID Use
1 2009 101 360
2 2010 101 275
3 2009 102 263
4 2010 102 436

[[3]]
  Year  ID Use
1 2009 101 209
2 2010 101 260
3 2009 102 234
4 2010 102 354

How it works:

The function lapply is used to apply another function multiple times. The command seq(3) generates a vector of numbers from 1 to 3. The parameter x represents one of these numbers. The function aggregate is used to apply another function to the Use values grouped by Year and ID. The parameter y represents the Use values in one group. The command y[order(-y)] sorts the Use values in descending order. Afterwards, [x] is used to extract the first, second, and third element, respectively, of this ordered vector.

Question 2

First set up the test data in easily reproducible form:

# set up test data

Lines <- "Year ID Month Use
2009 101 1 103
2009 101 2 209
2009 101 3 375
2009 101 4 360
2010 101 1 170
2010 101 2 381
2010 101 3 275
2010 101 4 260
2009 102 1 263
2009 102 2 234
2009 102 3 45
2009 102 4 275
2010 102 1 469
2010 102 2 107
2010 102 3 354
2010 102 4 436
"
DF <- read.table(text = Lines, header = TRUE)

Now that we have the input data here are some approaches:

1) sqldf/SQLite The following three SQL statements should calculate these quantities. If they run too slowly you could try adding a Year, ID index. Note that the three SQL statements are the same except for the from clauses:

Now create the three data frames:

library(sqldf)

max1 <- sqldf("select Year, ID, max(Use) Use 
   from DF 
   group by Year, ID") 

max2 <- sqldf("select Year, ID, max(Use) Use 
   from (select Year, ID, Use from DF 
         except select * from max1) 
   group by Year, ID")

max3 <- sqldf("select Year, ID, max(Use) Use 
   from (select Year, ID, Use from DF 
         except select * from max1 
         except select * from max2) 
   group by Year, ID")

2) sqldf/PostgreSQL The above is for sqldf with sqlite but it is even easier with sqldf and PostgreSQL because then we could use PostgreSQL's rank() windowing function. (There is more info on using PostgreSQL with sqldf here.)

library(RPostgreSQL)
library(sqldf)

DF2 <- sqldf('select *, rank() over (partition by "Year", "ID" order by "Use" desc) 
              from "DF"')
split(DF2[1:4], DF2$rank)[1:3]

The last line could alternately be replaced with this:

lapply(1:3, function(r) subset(DF2, rank == r)[1:4])

If we wanted a pure SQL solution then:

max1 <- sqldf('select "Year", "ID", "Month", "Use" from "DF2" where "rank" = 1')
max2 <- sqldf('select "Year", "ID", "Month", "Use" from "DF2" where "rank" = 2')
max3 <- sqldf('select "Year", "ID", "Month", "Use" from "DF2" where "rank" = 3')

or to produce a list of data frames:

lapply(1:3, function(r) 
   fn$sqldf('select "Year", "ID", "Month", "Use" from "DF2" where "rank" = $r'))

3) ave Its not so hard doing this in straight R. Here the Rank 1's are the largest, the Rank 2's the second largest, etc. so we just split on Rank as in the prior solution and take the first three components:

Rank <- with(DF, ave(-Use, Year, ID, FUN = rank))
split(DF, Rank)[1:3]

This would also work in place of the last line:

lapply(1:3, function(r) subset(DF, Rank == r))

which returns a list whose components are the three data frames.

UPDATE: Wrote out the second solution too.