Select a value for based on a highest value in another column

https://stackoverflow.com//questions/10705290

r
reshape2

13-12-2019
|

Question

I don't understand why I can't find a solution for this, since I feel that this is a pretty basic question. Need to ask for help, then. I want to rearrange airquality dataset by month with maximum temp value for each month. In addition I want to find the corresponding day for each monthly maximum temperature. What is the laziest (code-wise) way to do this?

I have tried following without a success:

require(reshape2)
names(airquality) <- tolower(names(airquality))
mm <- melt(airquality, id.vars = c("month", "day"), meas = c("temp"))

dcast(mm, month + day ~ variable, max)
aggregate(formula = temp ~ month + day, data = airquality, FUN = max)

I am after something like this:

month day temp
5     7    89
...

Solution

There was quite a discussion a while back about whether being lazy is good or not. Anwyay, this is short and natural to write and read (and is fast for large data so you don't need to change or optimize it later) :

require(data.table)
DT=as.data.table(airquality)

DT[,.SD[which.max(Temp)],by=Month]

     Month Ozone Solar.R Wind Temp Day
[1,]     5    45     252 14.9   81  29
[2,]     6    NA     259 10.9   93  11
[3,]     7    97     267  6.3   92   8
[4,]     8    76     203  9.7   97  28
[5,]     9    73     183  2.8   93   3

.SD is the subset of the data for each group, and you just want the row from it with the largest Temp, iiuc. If you need the row number then that can be added.

Or to get all the rows where the max is tied :

DT[,.SD[Temp==max(Temp)],by=Month]

     Month Ozone Solar.R Wind Temp Day
[1,]     5    45     252 14.9   81  29
[2,]     6    NA     259 10.9   93  11
[3,]     7    97     267  6.3   92   8
[4,]     7    97     272  5.7   92   9
[5,]     8    76     203  9.7   97  28
[6,]     9    73     183  2.8   93   3
[7,]     9    91     189  4.6   93   4

OTHER TIPS

Another approach with plyr

require(reshape2)
names(airquality) <- tolower(names(airquality))
mm <- melt(airquality, id.vars = c("month", "day"), meas = c("temp"), value.name = 'temp')

library(plyr)

ddply(mm, .(month), subset, subset = temp == max(temp), select = -variable)

Gives

  month day temp
1     5  29   81
2     6  11   93
3     7   8   92
4     7   9   92
5     8  28   97
6     9   3   93
7     9   4   93

Or, even simpler

require(reshape2)
require(plyr)
names(airquality) <- tolower(names(airquality))
ddply(airquality, .(month), subset, 
  subset = temp == max(temp), select = c(month, day, temp) )

how about with plyr?

max.func <- function(df) {
   max.temp <- max(df$temp)

   return(data.frame(day = df$Day[df$Temp==max.temp],
                     temp = max.temp))
}

ddply(airquality, .(Month), max.func)

As you can see, the max temperature for the month happens on more than one day. If you want different behavior, the function is easy enough to adjust.

Or if you want to use the data.table package (for instance, if speed is an issue and the data set is large or if you prefer the syntax):

library(data.table)
DT <- data.table(airquality)
DT[, list(maxTemp=max(Temp), dayMaxTemp=.SD[max(Temp)==Temp, Day]), by="Month"]

If you want to know what the .SD stands for, have a look here: SO

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow