Finding max, second max, and third max aggregated by two parameters in R or with R's sqldf

StackOverflow https://stackoverflow.com/questions/20082822

  •  02-08-2022
  •  | 
  •  

Question

So I am trying to find the max, second max, and third max water use per customer ID per year for a dataset. I'm using R and the sqldf library, but am open to any R solutions to this. Here's a bit of sample data:

 Year  | ID | Month  | Use |
----------------------------
2009    101 1   103

2009    101 2   209

2009    101 3   375

2009    101 4   360

2010    101 1   170

2010    101 2   381

2010    101 3   275

2010    101 4   260

2009    102 1   263

2009    102 2   234

2009    102 3   45

2009    102 4   275

2010    102 1   469

2010    102 2   107

2010    102 3   354

2010    102 4   436

Ideally I would want to return three matrices, max1, max2, max3 with columns ID, Year, Max (or second max or third max, respectively) So max1 = [101, 2009, 375, 101, 2010, 381, 102, 2009, 275, 102, 2010, 469] etc.

My initial approach was to make a nested for loop with listofIDs and listofYears as the domains of ID and Year, like:

for i in 1:length(listofIDs){

for y in 1:length(listofYears){

monthlylist<-sqldf("select Month, Use from Dataframe where ID=listofIDs[i] and Year=listofYears[y]")

and then sort monthlylist and pull out the max's, etc. But sqldf won't read variables like that so I would have to explicitly state where ID = 101, where ID = 102 each time.

Any ideas on how to get sqldf to recognize my varibles, or a better way to find the max, second max, and third max aggregated by year and ID? I am working with big datasets so ideally something that doesn't take forever.

Était-ce utile?

La solution

The following code creates a list of three data frames (dat is your original data frame):

lapply(seq(3), function(x)
  aggregate(Use ~ Year + ID, dat, function(y)
    y[order(-y)][x]))

The result:

[[1]]
  Year  ID Use
1 2009 101 375
2 2010 101 381
3 2009 102 275
4 2010 102 469

[[2]]
  Year  ID Use
1 2009 101 360
2 2010 101 275
3 2009 102 263
4 2010 102 436

[[3]]
  Year  ID Use
1 2009 101 209
2 2010 101 260
3 2009 102 234
4 2010 102 354

How it works:

The function lapply is used to apply another function multiple times. The command seq(3) generates a vector of numbers from 1 to 3. The parameter x represents one of these numbers. The function aggregate is used to apply another function to the Use values grouped by Year and ID. The parameter y represents the Use values in one group. The command y[order(-y)] sorts the Use values in descending order. Afterwards, [x] is used to extract the first, second, and third element, respectively, of this ordered vector.

Autres conseils

First set up the test data in easily reproducible form:

# set up test data

Lines <- "Year ID Month Use
2009 101 1 103
2009 101 2 209
2009 101 3 375
2009 101 4 360
2010 101 1 170
2010 101 2 381
2010 101 3 275
2010 101 4 260
2009 102 1 263
2009 102 2 234
2009 102 3 45
2009 102 4 275
2010 102 1 469
2010 102 2 107
2010 102 3 354
2010 102 4 436
"
DF <- read.table(text = Lines, header = TRUE)

Now that we have the input data here are some approaches:

1) sqldf/SQLite The following three SQL statements should calculate these quantities. If they run too slowly you could try adding a Year, ID index. Note that the three SQL statements are the same except for the from clauses:

Now create the three data frames:

library(sqldf)

max1 <- sqldf("select Year, ID, max(Use) Use 
   from DF 
   group by Year, ID") 

max2 <- sqldf("select Year, ID, max(Use) Use 
   from (select Year, ID, Use from DF 
         except select * from max1) 
   group by Year, ID")

max3 <- sqldf("select Year, ID, max(Use) Use 
   from (select Year, ID, Use from DF 
         except select * from max1 
         except select * from max2) 
   group by Year, ID")

2) sqldf/PostgreSQL The above is for sqldf with sqlite but it is even easier with sqldf and PostgreSQL because then we could use PostgreSQL's rank() windowing function. (There is more info on using PostgreSQL with sqldf here.)

library(RPostgreSQL)
library(sqldf)

DF2 <- sqldf('select *, rank() over (partition by "Year", "ID" order by "Use" desc) 
              from "DF"')
split(DF2[1:4], DF2$rank)[1:3]

The last line could alternately be replaced with this:

lapply(1:3, function(r) subset(DF2, rank == r)[1:4])

If we wanted a pure SQL solution then:

max1 <- sqldf('select "Year", "ID", "Month", "Use" from "DF2" where "rank" = 1')
max2 <- sqldf('select "Year", "ID", "Month", "Use" from "DF2" where "rank" = 2')
max3 <- sqldf('select "Year", "ID", "Month", "Use" from "DF2" where "rank" = 3') 

or to produce a list of data frames:

lapply(1:3, function(r) 
   fn$sqldf('select "Year", "ID", "Month", "Use" from "DF2" where "rank" = $r'))

3) ave Its not so hard doing this in straight R. Here the Rank 1's are the largest, the Rank 2's the second largest, etc. so we just split on Rank as in the prior solution and take the first three components:

Rank <- with(DF, ave(-Use, Year, ID, FUN = rank))
split(DF, Rank)[1:3]

This would also work in place of the last line:

lapply(1:3, function(r) subset(DF, Rank == r))

which returns a list whose components are the three data frames.

UPDATE: Wrote out the second solution too.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top