Question

I'm trying something that I thought would be rather simple in R, but is giving me more trouble then I bargained for. I'd like to use R to define spells based on multiple criteria, while ignoring missing data. The goal is then to compute wage means across spells using the aggregate command. I suspect that the tools provided in the TraMineR package may be used to accomplish this, but I'm having a hard time figuring out how.

For example, given the following data:

Caseid     Year        Unemployed  EmployerID  occID   indID  Wage           
1          1999         0          1           1       1      5.00       
1          2000         NA         NA          NA      NA     NA       
1          2001         NA         NA          NA      NA     NA       
1          2002         0          1           1       2      6.00       
2          1999         0          1           1       1      4.00
2          2000         0          1           1       1      5.00
2          2001         0          1           1       1      6.00
2          2002         1          1           1       1      6.00
3          1999         0          1           1       1      4.00
3          2000         0          3           1       1      5.00
3          2001         0          1           4       1      5.00
3          2002         NA         NA          NA      NA     NA
4          1999         0          1           1       1      5.00
4          2000         0          1           1       1      5.00
4          2001         0          1           1       1      7.00
4          2002         0          1           1       1      7.00

I'd like to write code that defines spells based on changes in either employment status, employer, occupation, or industry. In addition I'd like to ignore missing values. Given that, the correct code should return the following vector for "Spell":

Caseid     Year        Unemployed  EmployerID  occID   indID  Wage   Spell         
1          1999         0          1           1       1      5.00   1     
1          2000         NA         NA          NA      NA     NA     1  
1          2001         NA         NA          NA      NA     NA     1  
1          2002         0          1           1       2      6.00   2    
2          1999         0          1           1       1      4.00   1
2          2000         0          1           1       1      5.00   1
2          2001         0          1           1       1      6.00   1
2          2002         1          1           1       1      6.00   2
3          1999         0          1           1       1      4.00   1
3          2000         0          3           1       1      5.00   2
3          2001         0          1           4       1      5.00   3
3          2002         NA         NA          NA      NA     NA     3
4          1999         0          1           1       1      5.00   1
4          2000         0          1           1       1      5.00   1
4          2001         0          1           1       1      7.00   1
4          2002         0          1           1       1      7.00   1

Ultimately I'd like to use the spell vector to average wages across within person spells. Returning the following:

Caseid     Year        Unemployed  EmployerID  occID   indID  Wage   Spell  avgWage         
1          1999         0          1           1       1      5.00   1      5.00
1          2000         NA         NA          NA      NA     NA     1      5.00
1          2001         NA         NA          NA      NA     NA     1      5.00
1          2002         0          1           1       2      6.00   2      6.00
2          1999         0          1           1       1      4.00   1      5.00
2          2000         0          1           1       1      5.00   1      5.00
2          2001         0          1           1       1      6.00   1      5.00
2          2002         1          1           1       1      6.00   2      6.00
3          1999         0          1           1       1      4.00   1      4.00
3          2000         0          3           1       1      5.00   2      5.00
3          2001         0          1           4       1      5.00   3      5.00
3          2002         NA         NA          NA      NA     NA     3      5.00
4          1999         0          1           1       1      5.00   1      6.00
4          2000         0          1           1       1      5.00   1      6.00
4          2001         0          1           1       1      7.00   1      6.00
4          2002         0          1           1       1      7.00   1      6.00

Here is the data for Debugging. One note is that the newemp (newemployer) variable is different then the example I provided it should only start a new spell if the value is 1. So a series of 4 years where newemp=1 should not represent one spell but four different spells:

    df <- as.data.frame(structure(list(caseid = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 
    4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
    4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
    5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), year = c(1979L, 
    1980L, 1981L, 1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 
    1989L, 1990L, 1991L, 1992L, 1993L, 1994L, 1996L, 1998L, 2000L, 
    2002L, 2004L, 2006L, 2008L, 2010L, 1979L, 1980L, 1981L, 1982L, 
    1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 1991L, 
    1992L, 1993L, 1994L, 1996L, 1998L, 2000L, 2002L, 2004L, 2006L, 
    2008L, 2010L, 1979L, 1980L, 1981L, 1982L, 1983L, 1984L, 1985L, 
    1986L, 1987L, 1988L, 1989L, 1990L, 1991L, 1992L, 1993L, 1994L, 
    1996L, 1998L, 2000L, 2002L, 2004L, 2006L, 2008L, 2010L, 1979L, 
    1980L, 1981L, 1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 
    1989L, 1990L, 1991L, 1992L, 1993L, 1994L, 1996L, 1998L, 2000L, 
    2002L, 2004L, 2006L, 2008L, 2010L, 1979L, 1980L, 1981L, 1982L, 
    1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 1991L, 
    1992L, 1993L, 1994L, 1996L, 1998L, 2000L, 2002L, 2004L, 2006L, 
    2008L, 2010L), unemp = c(0, NA, 0, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 
    0, NA, 0, NA, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 
    0, NA, NA, NA, NA, 1, NA, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, NA, 0, 
    0, NA, 1, 0, NA, NA, NA, NA, NA, NA, NA, 0, 0, 1, 0, 0, NA, 0, 
    NA, 0, 0, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), 
newemp = c(NA, NA, 0, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, NA, NA, NA, NA, NA, 0, 
1, 1, 1, NA, 1, NA, 0, NA, NA, NA, NA, NA, 1, 0, 0, 1, 0, 
1, NA, NA, 0, 0, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, 1, 0, 0, 1, NA, 0, 0, NA, 1, 0, NA, NA, NA, NA, 
NA, NA, NA, 0, 0, 1, 1, 1, NA, 1, NA, 1, 0, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA), stocc = c(335, NA, 337, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, 337, 337, 337, 337, 337, 23, 386, 
23, 23, 337, 389, 23, 337, 23, 337, NA, NA, NA, NA, NA, 276, 
276, 276, NA, 383, 376, NA, 383, NA, NA, 383, NA, 447, 468, 
155, 468, 373, 188, 243, NA, NA, 243, 22, 277, NA, 22, 469, 
NA, NA, NA, NA, 274, NA, NA, NA, 313, 313, 313, 313, 313, 
NA, 313, 313, NA, 313, 178, NA, NA, NA, NA, NA, NA, 329, 
329, 329, 355, 223, 223, NA, 178, NA, 178, 178, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), stind = c(711, NA, 
711, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, 172, 172, 172, 172, 172, 
172, 172, 172, 172, 172, 172, 172, 172, 172, 172, NA, NA, 
NA, NA, NA, 641, 641, 641, NA, 700, 700, NA, 700, NA, NA, 
700, NA, 840, 770, 842, 862, 172, 623, 682, NA, NA, 682, 
172, 671, NA, 791, 791, NA, NA, NA, NA, 591, NA, NA, NA, 
841, 841, 841, 841, 841, NA, 712, 841, NA, 841, 841, NA, 
NA, NA, NA, NA, NA, 850, 850, 850, 932, 850, 841, NA, 841, 
NA, 841, 841, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA), lwage = c(2.14335489273071, NA, 2.0160756111145, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, 2.30358457565308, 2.34806489944458, 
2.36942100524902, 2.41516351699829, 2.38407301902771, 2.23588967323303, 
2.60783195495605, 2.58224511146545, 2.68043231964111, 2.70430994033813, 
2.76339650154114, 2.76763892173767, 2.72537922859192, 2.83617949485779, 
2.88961029052734, NA, NA, NA, NA, NA, 2.28949975967407, 2.15297079086304, 
NA, NA, 2.25023865699768, 2.20731782913208, NA, 2.15908432006836, 
NA, NA, 2.17475175857544, NA, 0.0605304837226868, 0.940007209777832, 
2.2104697227478, 2.22159194946289, 0.130852773785591, 0.725372314453125, 
2.02960777282715, NA, NA, 2.09433007240295, 2.38683438301086, 
NA, NA, NA, NA, NA, NA, NA, NA, 1.89671993255615, NA, NA, 
NA, 2.63665437698364, 2.7040421962738, 2.79728126525879, 
2.72129535675049, 3.03042364120483, NA, 3.02664947509766, 
2.7957558631897, NA, 2.86539578437805, 2.20382499694824, 
NA, NA, NA, NA, NA, NA, 2.08691358566284, 2.03152418136597, 
2.10608339309692, 2.17702174186707, 2.16355276107788, 3.65519332885742, 
NA, 3.80884671211243, NA, 3.37032580375671, 3.52329707145691, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA))))
Was it helpful?

Solution

Much less elegant than @BrodieG's data.table solution (which convinces me I really must familiarise myself with data.table!), but since I've coded it I may as well provide it here.

d <- read.table(text='Caseid     Year        Unemployed  EmployerID  occID   indID  Wage           
1          1999         0          1           1       1      5.00       
1          2000         NA         NA          NA      NA     NA       
1          2001         NA         NA          NA      NA     NA       
1          2002         0          1           1       2      6.00       
2          1999         0          1           1       1      4.00
2          2000         0          1           1       1      5.00
2          2001         0          1           1       1      6.00
2          2002         1          1           1       1      6.00
3          1999         0          1           1       1      4.00
3          2000         0          3           1       1      5.00
3          2001         0          1           4       1      5.00
3          2002         NA         NA          NA      NA     NA
4          1999         0          1           1       1      5.00
4          2000         0          1           1       1      5.00
4          2001         0          1           1       1      7.00
4          2002         0          1           1       1      7.00', header=TRUE)


d <- merge(unsplit(
  lapply(split(na.omit(d), na.omit(d)$Caseid), function(x) {
    cbind(x, Spell=cumsum(!duplicated(x[, 3:6])))
  }), 
  na.omit(d)$Caseid), d, all=TRUE)

d <- merge(d, aggregate(list(avgWage=d$Wage), 
                        list(Caseid=d$Caseid, Spell=d$Spell), 
                        mean, na.rm=TRUE), 
           all.x=TRUE)

d[order(d$Caseid, d$Year), ]

Note, though, that this returns NA for Wage and avgWage where rows contain NA.

OTHER TIPS

Here is a data.table solution:

library(data.table)
dt <- data.table(df)
dt[
  !is.na(Unemployed), 
  change:=
    as.numeric(
      apply(
        vapply(.SD, function(x) as.logical(c(0, diff(x))), logical(.N)),
        1,
        any
    ) ),
  by=Caseid, 
  .SDcols=3:6
]
dt[, spell:=cumsum(ifelse(is.na(change), 0, change)) + 1, by=Caseid]
dt[, avgWage:=mean(Wage, na.rm=T), by=list(Caseid, spell)]
dt
#     Caseid Year Unemployed EmployerID occID indID Wage change spell avgWage
#  1:      1 1999          0          1     1     1    5      0     1       5
#  2:      1 2000         NA         NA    NA    NA   NA     NA     1       5
#  3:      1 2001         NA         NA    NA    NA   NA     NA     1       5
#  4:      1 2002          0          1     1     2    6      1     2       6
#  5:      2 1999          0          1     1     1    4      0     1       5
#  6:      2 2000          0          1     1     1    5      0     1       5
#  7:      2 2001          0          1     1     1    6      0     1       5
#  8:      2 2002          1          1     1     1    6      1     2       6
#  9:      3 1999          0          1     1     1    4      0     1       4
# 10:      3 2000          0          3     1     1    5      1     2       5
# 11:      3 2001          0          1     4     1    5      1     3       5
# 12:      3 2002         NA         NA    NA    NA   NA     NA     3       5
# 13:      4 1999          0          1     1     1    5      0     1       6
# 14:      4 2000          0          1     1     1    5      0     1       6
# 15:      4 2001          0          1     1     1    7      0     1       6
# 16:      4 2002          0          1     1     1    7      0     1       6    

Data, for debugging:

df <- structure(list(Caseid = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 
3L, 3L, 3L, 4L, 4L, 4L, 4L), Year = c(1999L, 2000L, 2001L, 2002L, 
1999L, 2000L, 2001L, 2002L, 1999L, 2000L, 2001L, 2002L, 1999L, 
2000L, 2001L, 2002L), Unemployed = c(0L, NA, NA, 0L, 0L, 0L, 
0L, 1L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 0L), EmployerID = c(1L, NA, 
NA, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, NA, 1L, 1L, 1L, 1L), occID = c(1L, 
NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, NA, 1L, 1L, 1L, 1L), 
    indID = c(1L, NA, NA, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, 
    1L, 1L, 1L, 1L), Wage = c(5, NA, NA, 6, 4, 5, 6, 6, 4, 5, 
    5, NA, 5, 5, 7, 7)), .Names = c("Caseid", "Year", "Unemployed", 
"EmployerID", "occID", "indID", "Wage"), class = "data.frame", row.names = c(NA, 
-16L))    

EDIT: updated to run with new data:

library(data.table)
dt <- data.table(df)
dt[!is.na(newemp), newemp:=cumsum(newemp), by=caseid]
dt[
  !is.na(unemp), 
  change:=
    as.numeric(
      apply(
        vapply(.SD, function(x) as.logical(c(0, diff(x))), logical(.N)),
        1,
        any
    ) ),
  by=caseid, 
  .SDcols=3:6
]
dt[, spell:=cumsum(ifelse(is.na(change), 0, change)) + 1, by=caseid]
dt[, avgWage:=mean(lwage, na.rm=T), by=list(caseid, spell)]
dt

Note the new data has some additional issues that aren't fully dealt with (i.e. some rows are partially NA, instead of fully NA as in the original). You'll have to tinker with the logic to get it to do exactly what you want.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top