transforming complex wide data to long in R

Question 1

You can do that with the plyr package:

# reading the data
df <- read.table(text = "id name gender job1 sjob1 ejob1 job2 sjob2 ejob2 job3 sjob3 ejob3
1  Jane F      100  1990  1992  103  1993  1995  104  1994  1997
2  Tom  M      200  1978  1980  400  1981  1985  NA   NA    NA", header = TRUE, strip.white = TRUE)

# needed package
require(plyr)

# transforming the data
df2 <- rbind(
  ddply(df, .(id, name, gender), mutate, year = sjob1, job = job1),
  ddply(df, .(id, name, gender), mutate, year = ejob1, job = job1),
  ddply(df, .(id, name, gender), mutate, year = sjob2, job = job2),
  ddply(df, .(id, name, gender), mutate, year = ejob2, job = job2),
  ddply(df, .(id, name, gender), mutate, year = sjob3, job = job3),
  ddply(df, .(id, name, gender), mutate, year = ejob3, job = job3)
)

# getting rid off NA's & ordering the dataframe by id
df2 <- na.omit(df2[order(df2$id),c(1:3,13,14)])

Question 2

Here's a sketch using reshape2 and plyr:

Step 1: Reshape to a "long" format which is somewhat different to what you're looking for:

library(reshape2)
df.m <- melt(df, id.vars=c("id", "name", "gender"))

This will give you start and end times, and classification, for the jobs.

Step 2: Isolate the job ID:

df.m$job.id <- as.integer(gsub("^(.*job)([0-9]+)$", "\\2", df.m$variable))
df.m$variable <- gsub("^(.*)([0-9]+)$", "\\1", df.m$variable)

Step 3: You can compute a table of job classifications (along with name and gender) for each person ID and job ID:

library(plyr)
df.jc <- rename(subset(df.m, variable=="job", select=c("id", "name", "gender", "value")), variable=job)

Step 4: To get a complete result, you'll need to dcast the data to get a "wide" format with two columns sjob and ejob and one observation per person ID per job ID. Then, you can adply to generate a sequence of years, and merge this back to df.jc.

I haven't tested the code, nor am I able to give you something for the last step, because I cannot easily read in your example data. It would have helped greatly if you had dput your data. Please ask a separate, more detailed question for further problems, and provide some code and data.

Question 3

First we reshape data from wide to long

We use reshape_toLong command in onetree package to reshape this data from wide to long.

devtools::install_github("yikeshu0611/onetree") #install yikeshu0611
library(onetree)
df.long = reshape_toLong(data=df,
                        id="id",
                        j="new",
                        value.var.prefix = c("job","sjob","ejob"))
df. long
   id name gender new job sjob ejob
1  1 Jane      F   1 100 1990 1992
2  1 Jane      F   2 103 1993 1995
3  1 Jane      F   3 104 1994 1997
4  2  Tom      M   1 200 1978 1980
5  2  Tom      M   2 400 1981 1985
6  2  Tom      M   3  NA   NA   NA

Get two data for sjob and ejob.

we can see in data df.long, year variables sjob and ejob is seperated, se we just get two dataframe and the row bind them.

df.sjob=df.long[,-7] #data with sjob
colnames(df.sjob)[6]="year" #change sjob to year

df.sjob
  id name gender new job year
1  1 Jane      F   1 100 1990
2  1 Jane      F   2 103 1993
3  1 Jane      F   3 104 1994
4  2  Tom      M   1 200 1978
5  2  Tom      M   2 400 1981
6  2  Tom      M   3  NA   NA

df.ejob=df.long[,-6] #data with sjob
colnames(df.ejob)[6]="year" #change sjob to year
  id name gender new job year
1  1 Jane      F   1 100 1992
2  1 Jane      F   2 103 1995
3  1 Jane      F   3 104 1997
4  2  Tom      M   1 200 1980
5  2  Tom      M   2 400 1985
6  2  Tom      M   3  NA   NA

Last step: rbind

rbind df.sjob and df.ejob

rbind(df.sjob,df.ejob)

   id name gender new job year
1   1 Jane      F   1 100 1990
2   1 Jane      F   2 103 1993
3   1 Jane      F   3 104 1994
4   2  Tom      M   1 200 1978
5   2  Tom      M   2 400 1981
6   2  Tom      M   3  NA   NA
7   1 Jane      F   1 100 1992
8   1 Jane      F   2 103 1995
9   1 Jane      F   3 104 1997
10  2  Tom      M   1 200 1980
11  2  Tom      M   2 400 1985
12  2  Tom      M   3  NA   NA