Question

I have a small nagging question that I hope I might be able to get some help on...

My data frame has a personID and houseID (as well a distance between the two), though one person may be matched to more than one house. I want to reshape my data so that there is only one observation per person and multiple columns for houseID. I read about melt and cast (or dcast) and am familiar enough with how to use them, but aren't sure how to create an indicator to differentiate between the first house associated with a given voter and the second.

This is what my dataset currently looks like:

personID              schoolID distance
10007347    87-Intl Pre-School      171
10051332      1-Masaryk Towers      153
10066650 74-East Midtown Plaze      193
10066650 75-East Midtown Plaze      106
10066650 76-East Midtown Plaze      195
10078124    87-Intl Pre-School      158

This is what I want my dataset to look like before I melt:

personID              schoolID distance  time
10007347    87-Intl Pre-School      171   1
10051332      1-Masaryk Towers      153   1 
10066650 74-East Midtown Plaze      193   1 
10066650 75-East Midtown Plaze      106   2
10066650 76-East Midtown Plaze      195   3
10078124    87-Intl Pre-School      158   1

In other words, I want to rank at the personID level. I thought there might be an R function I'm missing, but no luck yet. My hack solution was to set time to 1 for all observations at first, find all duplicates of personID, set the time of those duplicate observations to 2, find all duplicates of personID and time, set those the time of those duplicates to 3, etc. This won't scale well though.

Using my poor solution for a smaller datset, I melt() and then cast() using the reshape package to look like this:

personID             houseID_1             houseID_2             houseID_3
10007346    87-Intl Pre-School                  <NA>                  <NA>
10051331      1-Masaryk Towers                  <NA>                  <NA>
10066659 74-East Midtown Plaze 75-East Midtown Plaze 76-East Midtown Plaze
10078123    87-Intl Pre-School                  <NA>                  <NA>
10089347    87-Intl Pre-School                  <NA>                  <NA>
10100173    79-Waterside Plaza                  <NA>                  <NA>

I also have distance_1, distance_2, distance_3, but am leaving that out so it's easier to see my data.

If anyone could help with how to create the time variable, it would be much appreciated!

Thanks!

Was it helpful?

Solution

This is pretty easy with dplyr:

df <- read.csv(text =
"personID,schoolID,distance
10007347,87-Intl Pre-School,171
10051332,1-Masaryk Towers,153
10066650,74-East Midtown Plaze,193
10066650,75-East Midtown Plaze,106
10066650,76-East Midtown Plaze,195
10078124,87-Intl Pre-School,158")

library(dplyr)

df %.% group_by(personID) %.% mutate(time = row_number(personID))

In dplyr 0.2, you won't need the variable inside row_number():

df %.% group_by(personID) %.% mutate(time = row_number())

OTHER TIPS

The most likely candidate for a base R approach would be ave:

with(df, ave(personID, personID, FUN = seq_along))
# [1] 1 1 1 2 3 1

You will need to modify this if the "personID" column is a factor.

If your data were a data.table (let's call it "DT"), you can use sequence(.N) as follows:

DT[, time := sequence(.N), by = personID]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top