First set up the test data in easily reproducible form:
# set up test data
Lines <- "Year ID Month Use
2009 101 1 103
2009 101 2 209
2009 101 3 375
2009 101 4 360
2010 101 1 170
2010 101 2 381
2010 101 3 275
2010 101 4 260
2009 102 1 263
2009 102 2 234
2009 102 3 45
2009 102 4 275
2010 102 1 469
2010 102 2 107
2010 102 3 354
2010 102 4 436
"
DF <- read.table(text = Lines, header = TRUE)
Now that we have the input data here are some approaches:
1) sqldf/SQLite The following three SQL statements should calculate these quantities. If they run too slowly you could try adding a Year, ID index. Note that the three SQL statements are the same except for the from
clauses:
Now create the three data frames:
library(sqldf)
max1 <- sqldf("select Year, ID, max(Use) Use
from DF
group by Year, ID")
max2 <- sqldf("select Year, ID, max(Use) Use
from (select Year, ID, Use from DF
except select * from max1)
group by Year, ID")
max3 <- sqldf("select Year, ID, max(Use) Use
from (select Year, ID, Use from DF
except select * from max1
except select * from max2)
group by Year, ID")
2) sqldf/PostgreSQL The above is for sqldf with sqlite but it is even easier with sqldf and PostgreSQL because then we could use PostgreSQL's rank()
windowing function. (There is more info on using PostgreSQL with sqldf here.)
library(RPostgreSQL)
library(sqldf)
DF2 <- sqldf('select *, rank() over (partition by "Year", "ID" order by "Use" desc)
from "DF"')
split(DF2[1:4], DF2$rank)[1:3]
The last line could alternately be replaced with this:
lapply(1:3, function(r) subset(DF2, rank == r)[1:4])
If we wanted a pure SQL solution then:
max1 <- sqldf('select "Year", "ID", "Month", "Use" from "DF2" where "rank" = 1')
max2 <- sqldf('select "Year", "ID", "Month", "Use" from "DF2" where "rank" = 2')
max3 <- sqldf('select "Year", "ID", "Month", "Use" from "DF2" where "rank" = 3')
or to produce a list of data frames:
lapply(1:3, function(r)
fn$sqldf('select "Year", "ID", "Month", "Use" from "DF2" where "rank" = $r'))
3) ave Its not so hard doing this in straight R. Here the Rank 1's are the largest, the Rank 2's the second largest, etc. so we just split on Rank
as in the prior solution and take the first three components:
Rank <- with(DF, ave(-Use, Year, ID, FUN = rank))
split(DF, Rank)[1:3]
This would also work in place of the last line:
lapply(1:3, function(r) subset(DF, Rank == r))
which returns a list whose components are the three data frames.
UPDATE: Wrote out the second solution too.