How to emulate SQLs rank functions in R?
-
20-06-2021 - |
Question
What is the R equivalent of rank functions like the Oracle ROW_NUMBER()
, RANK()
, or DENSE_RANK()
("assign integer values to the rows depending on their order"; see http://www.orafaq.com/node/55)?
I agree that the functionality of each function can potentially be achieved in an ad-hoc fashion. But my main concern is the performance. It would be good to avoid using join or indexing access, for the sake of memory and speed.
Solution
The data.table
package, especially starting with version 1.8.1, offers much of the functionality of partition in SQL terms. rank(x, ties.method = "min")
in R is similar to Oracle RANK()
, and there's a way using factors (described below) to mimic the DENSE_RANK()
function. A way to mimic ROW_NUMBER
should be obvious by the end.
Here's an example: Load the latest version of data.table
from R-Forge:
install.packages("data.table",
repos= c("http://R-Forge.R-project.org", getOption("repos")))
library(data.table)
Create some example data:
set.seed(10)
DT<-data.table(ID=seq_len(4*3),group=rep(1:4,each=3),value=rnorm(4*3),
info=c(sample(c("a","b"),4*2,replace=TRUE),
sample(c("c","d"),4,replace=TRUE)),key="ID")
> DT
ID group value info
1: 1 1 0.01874617 a
2: 2 1 -0.18425254 b
3: 3 1 -1.37133055 b
4: 4 2 -0.59916772 a
5: 5 2 0.29454513 b
6: 6 2 0.38979430 a
7: 7 3 -1.20807618 b
8: 8 3 -0.36367602 a
9: 9 3 -1.62667268 c
10: 10 4 -0.25647839 d
11: 11 4 1.10177950 c
12: 12 4 0.75578151 d
Rank each ID
by decreasing value
within group
(note the -
in front of value
to denote decreasing order):
> DT[,valRank:=rank(-value),by="group"]
ID group value info valRank
1: 1 1 0.01874617 a 1
2: 2 1 -0.18425254 b 2
3: 3 1 -1.37133055 b 3
4: 4 2 -0.59916772 a 3
5: 5 2 0.29454513 b 2
6: 6 2 0.38979430 a 1
7: 7 3 -1.20807618 b 2
8: 8 3 -0.36367602 a 1
9: 9 3 -1.62667268 c 3
10: 10 4 -0.25647839 d 3
11: 11 4 1.10177950 c 1
12: 12 4 0.75578151 d 2
For DENSE_RANK()
with ties in the value being ranked, you could convert the value to a factor and then return the underlying integer values. For example, ranking each ID
based on info
within group
(compare infoRank
with infoRankDense
):
DT[,infoRank:=rank(info,ties.method="min"),by="group"]
DT[,infoRankDense:=as.integer(factor(info)),by="group"]
R> DT
ID group value info valRank infoRank infoRankDense
1: 1 1 0.01874617 a 1 1 1
2: 2 1 -0.18425254 b 2 2 2
3: 3 1 -1.37133055 b 3 2 2
4: 4 2 -0.59916772 a 3 1 1
5: 5 2 0.29454513 b 2 3 2
6: 6 2 0.38979430 a 1 1 1
7: 7 3 -1.20807618 b 2 2 2
8: 8 3 -0.36367602 a 1 1 1
9: 9 3 -1.62667268 c 3 3 3
10: 10 4 -0.25647839 d 3 2 2
11: 11 4 1.10177950 c 1 1 1
12: 12 4 0.75578151 d 2 2 2
p.s. Hi Matthew Dowle.
LEAD and LAG
For imitating LEAD and LAG, start with the answer provided here. I would create a rank variable based on the order of IDs within groups. This wouldn't be necessary with the fake data as above, but if the IDs are not in sequential order within groups, then this would make life a bit more difficult. So here's some new fake data with non-sequential IDs:
set.seed(10)
DT<-data.table(ID=sample(seq_len(4*3)),group=rep(1:4,each=3),value=rnorm(4*3),
info=c(sample(c("a","b"),4*2,replace=TRUE),
sample(c("c","d"),4,replace=TRUE)),key="ID")
DT[,idRank:=rank(ID),by="group"]
setkey(DT,group, idRank)
> DT
ID group value info idRank
1: 4 1 -0.36367602 b 1
2: 5 1 -1.62667268 b 2
3: 7 1 -1.20807618 b 3
4: 1 2 1.10177950 a 1
5: 2 2 0.75578151 a 2
6: 12 2 -0.25647839 b 3
7: 3 3 0.74139013 c 1
8: 6 3 0.98744470 b 2
9: 9 3 -0.23823356 a 3
10: 8 4 -0.19515038 c 1
11: 10 4 0.08934727 c 2
12: 11 4 -0.95494386 c 3
Then to get the values of the previous 1 record, use the group
and idRank
variables and subtract 1
from the idRank
and use the multi = 'last'
argument. To get the value from the record two entries above, subtract 2
.
DT[,prev:=DT[J(group,idRank-1), value, mult='last']]
DT[,prev2:=DT[J(group,idRank-2), value, mult='last']]
ID group value info idRank prev prev2
1: 4 1 -0.36367602 b 1 NA NA
2: 5 1 -1.62667268 b 2 -0.36367602 NA
3: 7 1 -1.20807618 b 3 -1.62667268 -0.3636760
4: 1 2 1.10177950 a 1 NA NA
5: 2 2 0.75578151 a 2 1.10177950 NA
6: 12 2 -0.25647839 b 3 0.75578151 1.1017795
7: 3 3 0.74139013 c 1 NA NA
8: 6 3 0.98744470 b 2 0.74139013 NA
9: 9 3 -0.23823356 a 3 0.98744470 0.7413901
10: 8 4 -0.19515038 c 1 NA NA
11: 10 4 0.08934727 c 2 -0.19515038 NA
12: 11 4 -0.95494386 c 3 0.08934727 -0.1951504
For LEAD, add the appropriate offset to the idRank
variable and switch to multi = 'first'
:
DT[,nex:=DT[J(group,idRank+1), value, mult='first']]
DT[,nex2:=DT[J(group,idRank+2), value, mult='first']]
ID group value info idRank prev prev2 nex nex2
1: 4 1 -0.36367602 b 1 NA NA -1.62667268 -1.2080762
2: 5 1 -1.62667268 b 2 -0.36367602 NA -1.20807618 NA
3: 7 1 -1.20807618 b 3 -1.62667268 -0.3636760 NA NA
4: 1 2 1.10177950 a 1 NA NA 0.75578151 -0.2564784
5: 2 2 0.75578151 a 2 1.10177950 NA -0.25647839 NA
6: 12 2 -0.25647839 b 3 0.75578151 1.1017795 NA NA
7: 3 3 0.74139013 c 1 NA NA 0.98744470 -0.2382336
8: 6 3 0.98744470 b 2 0.74139013 NA -0.23823356 NA
9: 9 3 -0.23823356 a 3 0.98744470 0.7413901 NA NA
10: 8 4 -0.19515038 c 1 NA NA 0.08934727 -0.9549439
11: 10 4 0.08934727 c 2 -0.19515038 NA -0.95494386 NA
12: 11 4 -0.95494386 c 3 0.08934727 -0.1951504 NA NA
OTHER TIPS
From data.table v1.9.5+
, function frank()
(for fast rank) has been implemented. frank()
is useful in interactive scenarios, where as frankv()
allows to easily program with.
It implements every operation available in base::rank
. In addition, the advantages are:
frank()
operates on list, data.frames and data.tables in addition to atomic vectors.We can specify, for each column, whether rank should be computed on increasing or decreasing order.
It also implements rank type
dense
in addition to other types inbase
.You can use
-
on a character column as well to rank by decreasing order.
Here's an illustration of all the above points using the same data.table DT
from @BenBarnes' (excellent) post.
data:
require(data.table)
set.seed(10)
sample_n <- function(x, n) sample(x, n, replace=TRUE)
DT <- data.table(
ID = seq_len(4*3),
group = rep(1:4,each=3),
value = rnorm(4*3),
info = c(sample_n(letters[1:2], 8), sample_n(letters[3:4], 4)))
On single columns:
Compute
dense
rank:DT[, rank := frank(value, ties.method="dense"), by=group]
You can also use the other methods min
, max
, random
, average
and first
.
In decreasing order:
DT[, rank := frank(-value, ties.method="dense"), by=group]
Using
frankv
, similar tofrank
:# increasing order frankv(DT, "value", ties.method="dense") # decreasing order frankv(DT, "value", order=-1L, ties.method="dense")
On multiple columns
You can use .SD
, which stands for Subset of Data and contains the data corresponding to that group. See the Introduction to data.table HTML vignette for more on .SD
.
Rank by
info, value
columns while grouping bygroup
:DT[, rank := frank(.SD, info, value, ties.method="dense"), by=group]
Use
-
to specify decreasing order:DT[, rank := frank(.SD, info, -value, ties.method="dense"), by=group]
You can also use
-
directly on character columnsDT[, rank := frank(.SD, -info, -value, ties.method="dense"), by=group]
You can use frankv
similarly and provide the columns to cols
argument and the order by which the columns should be ranked using the order
argument.
Small benchmark to compare with base::rank
:
set.seed(45L)
x = sample(1e4, 1e7, TRUE)
system.time(ans1 <- base::rank(x, ties.method="first"))
# user system elapsed
# 22.200 0.255 22.536
system.time(ans2 <- frank(x, ties.method="first"))
# user system elapsed
# 0.745 0.014 0.762
identical(ans1, ans2) # [1] TRUE
I like data.table as much as the next guy, but it isn't always necessary. data.table will always be faster, but even for moderately large data sets if the number of groups is fairly small, plyr will still perform adequately.
What BenBarnes did using data.table
s can be done just as compactly (but as I noted before probably slower in many cases) using plyr:
library(plyr)
ddply(DT,.(group),transform,valRank = rank(-value))
ddply(DT,.(group),transform,valRank = rank(info,ties.method = "min"),
valRankDense = as.integer(factor(info)))
and even without loading a single extra package at all:
do.call(rbind,by(DT,DT$group,transform,valRank = rank(-value)))
do.call(rbind,by(DT,DT$group,transform,valRank = rank(info,ties.method = "min"),
valRankDense = as.integer(factor(info))))
although you do lose some of the syntactic niceties in that last case.
Dplyr now has windows functions including row_number and dense_rank: https://dplyr.tidyverse.org/reference/ranking.html:
df <- tibble::tribble(
~subjects, ~date, ~visits,
1L, "21/09/1999", 2L,
1L, "29/04/1999", 4L,
2L, "18/02/1999", 15L,
3L, "10/07/1999", 13L,
4L, "27/08/1999", 7L,
7L, "27/10/1999", 14L,
10L, "18/04/1999", 8L,
13L, "27/09/1999", 14L,
14L, "15/09/1999", 6L,
16L, "27/11/1999", 14L,
20L, "06/02/1999", 4L,
22L, "07/09/1999", 12L,
23L, "24/03/1999", 14L,
24L, "19/01/1999", 7L,
)
Note ORDER BY does not need to be stipulated unlike in the ROW_NUMBER() SQL code.
df_partition <- df %>%
group_by(subjects) %>% # group_by is equivalent to GROUP BY in the SQL partition
ROW_NUMBER()
mutate(rn = row_number(visits),
rn_reversed = row_number(desc(visits))) %>%
ungroup() %>% # grouping by subjects remains on data unless removed like this
mutate(dense_rank = dense_rank(visits))
I don't think there's a direct equivalent to Oracle's Analytic functions. Plyr will likely be able to achieve some of the analytic functions, but not all directly. I'm sure R can replicate each function separately but I don't think there's a single package that will do it all.
If there's a specific operation you need to achieve in R, then do some googling, and if you come up empty, ask a specific question here on StackOverflow.