Question

I have a dataset containing information on students enrolled in an after-school program in the following format:

student_id  year group number
1   2009-10 1
2   2009-10 1
3   2009-10 2
4   2009-10 3
5   2009-10 3
1   2010-11 1
2   2010-11 2
3   2010-11 3
4   2010-11 2
5   2010-11 2

I want to measure retention for each group on a per-year basis. I need to write some kind of loop statement that will look back at the previous year, compute a value of similar IDs for each group, and return a value divided by total numbers in that group. I have sketched out code (which is probably inefficient/missing some steps) as follows?

for (i in levels(data$year)){
  if (i=="2009-10"){
    #no previous year to look for
    next
  }else{
    for(g in levels(data$group)){

    ##perhaps a plyr summarize function?

    #look for id in previous year for that group
    #compute count of identical ids
    #return value/length(group)
    }
  }

edit after reading some suggestions, perhaps it would be simpler to use the ddply(transform) function. is there a way to create an associative relationship between the year and group number? the code would look something like this:

tracking=ddply(data,"student_id", transform, enroll.year1=1, enroll.year2=ifelse(criteria goes here,1,0), enroll.year3=ifelse(criteria goes here,1,0)

some sample output might look like this:

Year    Group   retention rate

2010-11 1   0.88
2011-12 1   0.8

2010-11 2   0.5
2011-12 2   0.6

2010-11 3   0.5
2011-12 3   0.5

has anyone solved a similar retention problem before? I'm having difficulty conceptualizing the steps, let alone implementing in R. any help would be greatly appreciated. *

Was it helpful?

Solution

One solution I can think of involves a self-join shifting one year. I will be using data.table for the simplicity of both joining and the grouping required. I'll also be changing some names and the year format for convenience. I have saved your data in a data.frame called dd:

names(dd) <- c('id', 'year', 'group')
dd$year <- as.integer(substr(dd$year, 1, 4))

dd.prev <- dd
dd.prev$year <- dd.prev$year + 1   ## shifting year upwards so it matches the next year

require(data.table)
dd <- data.table(dd)
setkey(dd, group, year)

dd.prev <- data.table(dd.prev)
setkey(dd.prev, group, year)
setnames(dd.prev, 'id', 'id.prev')  ## changing variable name so it is distinct

extra.year <- max(dd$year) + 1  ## the shift generates an extra year
dd.prev <- dd.prev[!year==extra.year]  ## drop extra year as retention not defined

dd <- dd[dd.prev]   ## join data from previous year to current data

dd.all[, retention:=as.numeric(length(intersect(id, id.prev)) 
                               /length(unique(id.prev))), 
       by=list(year, id)]

That last bit computes the retention rate as you defined it: number of students who still remain from last year, intersect(id, id.prev), divided by the total number of students last year, unique(id.prev). With this data it only generates the retention for 2010, but with a longer series it would generate it for all years except the first.

UPDATE 1: Using plyr

names(dd) <- c('id', 'year', 'group')
dd$year <- as.integer(substr(dd$year, 1, 4))

dd.prev <- dd
dd.prev$year <- dd.prev$year + 1   ## shifting year upwards so it matches the next year
names(dd.prev)[1] <- 'id.prev'  ## changing variable name so it is distinct

extra.year <- max(dd$year) + 1  ## the shift generates an extra year
dd.prev <- dd.prev[dd.prev$year!=extra.year,]  ## drop extra year

dd <- merge(dd, dd.prev, all.y=TRUE)   ## join data from previous year to current data



require(plyr)
dd <- ddply(dd, .(group, year), summarize, 
             retention=length(intersect(id, id.prev)) 
                        /length(unique(id.prev)))

I hope that helps.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top