Question

I have a survey data set in wide form. For a particular question, a set of variables was created in the raw data to represent different the fact that the survey question was asked on a particular month.

I wish to create a new set of variables that have month-invariant names; the value of these variables will correspond to the value of a month-variant question for the month observed.

Please see an example / fictitious data set:

require(data.table)

data <- data.table(month = rep(c('may', 'jun', 'jul'),  each = 5),
                   may.q1 = rep(c('yes', 'no', 'yes'),  each = 5),
                   jun.q1 = rep(c('breakfast', 'lunch', 'dinner'),  each = 5),
                   jul.q1 = rep(c('oranges', 'apples', 'oranges'),  each = 5),
                   may.q2 = rep(c('econ', 'math', 'science'), each = 5),
                   jun.q2 = rep(c('sunny', 'foggy', 'cloudy'), each = 5),
                   jul.q2 = rep(c('no rain', 'light mist', 'heavy rain'), each = 5))

In this survey, there are really only two questions: "q1" and "q2". Each of these questions is repeatedly asked for several months. However, the observation contains a valid response only if the month observed in the data matches up with the survey question for a particular month.

For example: "may.q1" is observed as "yes" for any observation in "May". I would like a new "Q1" variable to represent "may.q1", "jun.q1", and "jul.q1". The value of "Q1" will take on the value of "may.q1" when the month is "may", and the value of "Q1" will take on the value of "jun.q1" when the month is "jun".

If I were to try and do this by hand using data table, I would want something like:

mdata <- data[month == 'may', c('month', 'may.q1', 'may.q2'), with = F]
setnames(mdata, names(mdata), gsub('may\\.', '', names(mdata)))

I would want this repeated "by = month".

If I were to use the "plyr" package for a data frame, I would solve using the following approach:

require(plyr)
data <- data.frame(data)

mdata <- ddply(data, .(month), function(dfmo) {
    dfmo <- dfmo[, c(1, grep(dfmo$month[1], names(dfmo)))]
    names(dfmo) <- gsub(paste0(dfmo$month[1], '\\.'), '', names(dfmo))
    return(dfmo)
})

Any help using a data.table method would be greatly appreciated, as my data are large. Thank you.

Was it helpful?

Solution

A different way to illustrate :

data[, .SD[,paste0(month,c(".q1",".q2")), with=FALSE], by=month]

    month  may.q1     may.q2
 1:   may     yes       econ
 2:   may     yes       econ
 3:   may     yes       econ
 4:   may     yes       econ
 5:   may     yes       econ
 6:   jun   lunch      foggy
 7:   jun   lunch      foggy
 8:   jun   lunch      foggy
 9:   jun   lunch      foggy
10:   jun   lunch      foggy
11:   jul oranges heavy rain
12:   jul oranges heavy rain
13:   jul oranges heavy rain
14:   jul oranges heavy rain
15:   jul oranges heavy rain

But note the column names come from the first group (can rename afterwards using setnames). And it may not be the most efficient if there are a great number of columns with only a few needed. In that case Arun's solution melting to long format should be faster.

OTHER TIPS

Edit: Seems very inefficient on bigger data. Check out @MatthewDowle's answer for a really fast and neat solution.

Here's a solution using data.table.

dd <- melt.dt(data, id.var=c("month"))[month == gsub("\\..*$", "", ind)][, 
        ind := gsub("^.*\\.", "", ind)][, split(values, ind), by=list(month)]

The function melt.dt is a small function (still more improvements to be made) I wrote to melt a data.table similar to that of the melt function in plyr (copy/paste this function shown below before trying out the code above).

melt.dt <- function(DT, id.var) {
    stopifnot(inherits(DT, "data.table"))
    measure.var <- setdiff(names(DT), id.var)
    ind <- rep.int(measure.var, rep.int(nrow(DT), length(measure.var)))
    m1  <- lapply(c("list", id.var), as.name)
    m2  <- as.call(lapply(c("factor", "ind"), as.name))
    m3  <- as.call(lapply(c("c", measure.var), as.name))    
    quoted <- as.call(c(m1, ind = m2, values = m3))
    DT[, eval(quoted)]
}

The idea: First melt the data.table with id.var = month column. Now, all your melted column names are of the form month.question. So, by removing ".question" from this melted column and equating with month column, we can remove all unnecessary entries. Once we did this, we don't need the "month." in the melted column "ind" anymore. So, we use gsub to remove "month." to retain just q1, q2 etc.. After this, we have to reshape (or cast) it. This is done by grouping by month and splitting the values column by ind (which has either q1 or q2. So, you'll get 2 columns for every month (which is then stitched together) to get your desired output.

What about something like this

data <- data.table(
                   may.q1 = rep(c('yes', 'no', 'yes'),  each = 5),
                   jun.q1 = rep(c('breakfast', 'lunch', 'dinner'),  each = 5),
                   jul.q1 = rep(c('oranges', 'apples', 'oranges'),  each = 5),
                   may.q2 = rep(c('econ', 'math', 'science'), each = 5),
                   jun.q2 = rep(c('sunny', 'foggy', 'cloudy'), each = 5),
                   jul.q2 = rep(c('no rain', 'light mist', 'heavy rain'), each = 5)
                   )


tmp <- reshape(data, direction = "long", varying = 1:6, sep = ".", timevar = "question")

str(tmp)
## Classes ‘data.table’ and 'data.frame':   30 obs. of  5 variables:
##  $ question: chr  "q1" "q1" "q1" "q1" ...
##  $ may     : chr  "yes" "yes" "yes" "yes" ...
##  $ jun     : chr  "breakfast" "breakfast" "breakfast" "breakfast" ...
##  $ jul     : chr  "oranges" "oranges" "oranges" "oranges" ...
##  $ id      : int  1 2 3 4 5 6 7 8 9 10 ...

If you want to go further and melting this data again you can use the melt package

require(reshape2)
## remove the id column if you want (id is the last col so ncol(tmp))
res <- melt(tmp[,-ncol(tmp), with = FALSE], measure.vars = c("may", "jun", "jul"), value.name = "response", variable.name = "month")

str(res)
## 'data.frame':    90 obs. of  3 variables:
##  $ question: chr  "q1" "q1" "q1" "q1" ...
##  $ month   : Factor w/ 3 levels "may","jun","jul": 1 1 1 1 1 1 1 1 1 1 ...
##  $ response: chr  "yes" "yes" "yes" "yes" ...
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top