R: Percentile calculations on subsets of data

https://stackoverflow.com/questions/19608618

01-07-2022
|

Question

I have a data set which contains the following identifiers, an rscore, gvkey, sic2, year, and cdom. What I am looking to do is calculate percentile ranks based on summed rscores for all temporal spans (~1500) for a given gvkey, and then calculate percentile ranks in a given temporal time span and sic2 based on gvkey.

Calculating the percentiles for all temporal time spans is a fairly quick process, however once I add in calculating the sic2 percentile ranks it's fairly slow, but we are likely looking at about ~65,000 subsets in total. I'm wondering if there is a possibility of speeding up this process.

The data for one temporal time span looks like the following

gvkey   sic2    cdom    rscoreSum   pct
1187    10  USA 8.00E-02    0.942268617
1265    10  USA -1.98E-01   0.142334654
1266    10  USA 4.97E-02    0.88565478
1464    10  USA -1.56E-02   0.445748247
1484    10  USA 1.40E-01    0.979807985
1856    10  USA -2.23E-02   0.398252565
1867    10  USA 4.69E-02    0.8791019
2047    10  USA -5.00E-02   0.286701209
2099    10  USA -1.78E-02   0.430915371
2127    10  USA -4.24E-02   0.309255308
2187    10  USA 5.07E-02    0.893020421

The code to calculate the industry ranks is below, and fairly straightforward.

#generate 2 digit industry SICs percentile ranks
dout <- ddply(dfSum, .(sic2), function(x){
  indPct <- rank(x$rscoreSum)/nrow(x)
  gvkey <- x$gvkey
  x <- data.frame(gvkey, indPct)
})

#merge 2 digit industry SIC percentile ranks with market percentile ranks
dfSum <- merge(dfSum, dout, by = "gvkey")
names(dfSum)[2] <- 'sic2'

Any suggestions to speed the process would be appreciated!

Solution

You might try the data.table package for fast operations across relatively large datasets like yours. For example, my machine has no problem working through this:

library(data.table)

# Create a dataset like yours, but bigger
n.rows <- 2e6
n.sic2 <- 1e4
dfSum <- data.frame(gvkey=seq_len(n.rows),
                    sic2=sample.int(n.sic2, n.rows, replace=TRUE),
                    cdom="USA",
                    rscoreSum=rnorm(n.rows))

# Now make your dataset into a data.table
dfSum <- data.table(dfSum)

# Calculate the percentiles
# Note that there is no need to re-assign the result
dfSum[, indPct:=rank(rscoreSum)/length(rscoreSum), by="sic2"]

whereas the plyr equivalent takes a while.

If you like the plyr syntax (I do), you may also be interested in the dplyr package, which is billed as "the next generation of plyr", with support for faster data stores in the backend.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow