Pergunta

I thought I did transform the wide data to long, and I've been working with it for a while but I've recently found that something went wrong. Obviously, the code was wrong, and I can't seem to fix it.

The wide data is complex because it includes information on when a person started his/her first job, second job, and so on. I want to turn this into panel data.

Thus the original data df looks like the following:

id name gender job1 sjob1 ejob1 job2 sjob2 ejob2 job3 sjob3 ejob3
1  Jane F      100  1990  1992  103  1993  1995  104  1994  1997
2  Tom  M      200  1978  1980  400  1981  1985  NA   NA    NA

Job numbers are job codes, indicating certain jobs i.e managerial, sales, etc.

Above is a very short version of the full data I have. The wanted output is:

id name gender year job
1  Jane F      1990 100
1  Jane F      1991 100
1  Jane F      1992 100
1  Jane F      1993 103
1  Jane F      1994 104
1  Jane F      1995 104
1  Jane F      1996 104
1  Jane F      1997 104
2  Tom  M      1978 200
2  Tom  M      1979 200
2  Tom  M      1980 200
2  Tom  M      1981 400
2  Tom  M      1982 400
2  Tom  M      1983 400
2  Tom  M      1984 400
2  Tom  M      1985 400

I have in total approximately 1600 observations for the wide version. (1600 people).I tried the following but did not work:

df_long <-reshape(df,
          varying=c("job1", "job2", "job3"),
          v.names="job",
          timevar="year",
          times=c("sjob1", "sjob2", "sjob3"),
          direction = "long")

This actually succeeded in saving job codes in the order of sjob1, sjob2, sjob3 (the start year of each job) but did not save the years under sjob1 but just recorded sjob1 instead:

 id name gender year job
1  Jane F      sjob1 100
1  Jane F      sjob2 103
1  Jane F      sjob3 104
2  Tom  M      sjob1 200
2  Tom  M      sjob2 400
2  Tom  M      sjob3 NA

The above is an example of the original data I have, so I would like to post my original data as well: https://www.dropbox.com/s/ygbkd91ataqkwz5/origin_wide.RData

Foi útil?

Solução

You can do that with the plyr package:

# reading the data
df <- read.table(text = "id name gender job1 sjob1 ejob1 job2 sjob2 ejob2 job3 sjob3 ejob3
1  Jane F      100  1990  1992  103  1993  1995  104  1994  1997
2  Tom  M      200  1978  1980  400  1981  1985  NA   NA    NA", header = TRUE, strip.white = TRUE)

# needed package
require(plyr)

# transforming the data
df2 <- rbind(
  ddply(df, .(id, name, gender), mutate, year = sjob1, job = job1),
  ddply(df, .(id, name, gender), mutate, year = ejob1, job = job1),
  ddply(df, .(id, name, gender), mutate, year = sjob2, job = job2),
  ddply(df, .(id, name, gender), mutate, year = ejob2, job = job2),
  ddply(df, .(id, name, gender), mutate, year = sjob3, job = job3),
  ddply(df, .(id, name, gender), mutate, year = ejob3, job = job3)
)

# getting rid off NA's & ordering the dataframe by id
df2 <- na.omit(df2[order(df2$id),c(1:3,13,14)])

Outras dicas

Here's a sketch using reshape2 and plyr:

Step 1: Reshape to a "long" format which is somewhat different to what you're looking for:

library(reshape2)
df.m <- melt(df, id.vars=c("id", "name", "gender"))

This will give you start and end times, and classification, for the jobs.

Step 2: Isolate the job ID:

df.m$job.id <- as.integer(gsub("^(.*job)([0-9]+)$", "\\2", df.m$variable))
df.m$variable <- gsub("^(.*)([0-9]+)$", "\\1", df.m$variable)

Step 3: You can compute a table of job classifications (along with name and gender) for each person ID and job ID:

library(plyr)
df.jc <- rename(subset(df.m, variable=="job", select=c("id", "name", "gender", "value")), variable=job)

Step 4: To get a complete result, you'll need to dcast the data to get a "wide" format with two columns sjob and ejob and one observation per person ID per job ID. Then, you can adply to generate a sequence of years, and merge this back to df.jc.

I haven't tested the code, nor am I able to give you something for the last step, because I cannot easily read in your example data. It would have helped greatly if you had dput your data. Please ask a separate, more detailed question for further problems, and provide some code and data.

  1. First we reshape data from wide to long

We use reshape_toLong command in onetree package to reshape this data from wide to long.

devtools::install_github("yikeshu0611/onetree") #install yikeshu0611
library(onetree)
df.long = reshape_toLong(data=df,
                        id="id",
                        j="new",
                        value.var.prefix = c("job","sjob","ejob"))
df. long
   id name gender new job sjob ejob
1  1 Jane      F   1 100 1990 1992
2  1 Jane      F   2 103 1993 1995
3  1 Jane      F   3 104 1994 1997
4  2  Tom      M   1 200 1978 1980
5  2  Tom      M   2 400 1981 1985
6  2  Tom      M   3  NA   NA   NA
  1. Get two data for sjob and ejob.

we can see in data df.long, year variables sjob and ejob is seperated, se we just get two dataframe and the row bind them.

df.sjob=df.long[,-7] #data with sjob
colnames(df.sjob)[6]="year" #change sjob to year

df.sjob
  id name gender new job year
1  1 Jane      F   1 100 1990
2  1 Jane      F   2 103 1993
3  1 Jane      F   3 104 1994
4  2  Tom      M   1 200 1978
5  2  Tom      M   2 400 1981
6  2  Tom      M   3  NA   NA

df.ejob=df.long[,-6] #data with sjob
colnames(df.ejob)[6]="year" #change sjob to year
  id name gender new job year
1  1 Jane      F   1 100 1992
2  1 Jane      F   2 103 1995
3  1 Jane      F   3 104 1997
4  2  Tom      M   1 200 1980
5  2  Tom      M   2 400 1985
6  2  Tom      M   3  NA   NA
  1. Last step: rbind

rbind df.sjob and df.ejob

rbind(df.sjob,df.ejob)

   id name gender new job year
1   1 Jane      F   1 100 1990
2   1 Jane      F   2 103 1993
3   1 Jane      F   3 104 1994
4   2  Tom      M   1 200 1978
5   2  Tom      M   2 400 1981
6   2  Tom      M   3  NA   NA
7   1 Jane      F   1 100 1992
8   1 Jane      F   2 103 1995
9   1 Jane      F   3 104 1997
10  2  Tom      M   1 200 1980
11  2  Tom      M   2 400 1985
12  2  Tom      M   3  NA   NA
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top