One solution I can think of involves a self-join shifting one year. I will be using data.table
for the simplicity of both joining and the grouping required. I'll also be changing some names and the year format for convenience. I have saved your data in a data.frame
called dd
:
names(dd) <- c('id', 'year', 'group')
dd$year <- as.integer(substr(dd$year, 1, 4))
dd.prev <- dd
dd.prev$year <- dd.prev$year + 1 ## shifting year upwards so it matches the next year
require(data.table)
dd <- data.table(dd)
setkey(dd, group, year)
dd.prev <- data.table(dd.prev)
setkey(dd.prev, group, year)
setnames(dd.prev, 'id', 'id.prev') ## changing variable name so it is distinct
extra.year <- max(dd$year) + 1 ## the shift generates an extra year
dd.prev <- dd.prev[!year==extra.year] ## drop extra year as retention not defined
dd <- dd[dd.prev] ## join data from previous year to current data
dd.all[, retention:=as.numeric(length(intersect(id, id.prev))
/length(unique(id.prev))),
by=list(year, id)]
That last bit computes the retention rate as you defined it: number of students who still remain from last year, intersect(id, id.prev)
, divided by the total number of students last year, unique(id.prev)
. With this data it only generates the retention for 2010, but with a longer series it would generate it for all years except the first.
UPDATE 1: Using plyr
names(dd) <- c('id', 'year', 'group')
dd$year <- as.integer(substr(dd$year, 1, 4))
dd.prev <- dd
dd.prev$year <- dd.prev$year + 1 ## shifting year upwards so it matches the next year
names(dd.prev)[1] <- 'id.prev' ## changing variable name so it is distinct
extra.year <- max(dd$year) + 1 ## the shift generates an extra year
dd.prev <- dd.prev[dd.prev$year!=extra.year,] ## drop extra year
dd <- merge(dd, dd.prev, all.y=TRUE) ## join data from previous year to current data
require(plyr)
dd <- ddply(dd, .(group, year), summarize,
retention=length(intersect(id, id.prev))
/length(unique(id.prev)))
I hope that helps.