Question

I am having troubles converting a complicated data base output from its wide form to the long format. It has a couple of hundred rows and ~1,000 columns. It looks something like this:

wide df and melt result

The problem is that the variable of df1_long or the columns of df1_wide contain information that I want to select upon. So I rather want to have something like this:

ID       part   task    subgroup    type    result
Ind_A       a     12          aa       2    yes
Ind_A       a     12          bb       2    yes
Ind_A       b     12          aa       3    opt_1
Ind_A       b     13          aa       4    100
Ind_B       a     12          aa       2    no
Ind_B       a     12          bb       2    yes
Ind_B       b     12          aa       3    opt_2
Ind_B       b     13          aa       4    50
Ind_C       a     12          aa       2    no
Ind_C       a     12          bb       2    no
Ind_C       b     12          aa       3    opt_1
Ind_C       b     13          aa       4    200

I dont mind that the numerical values/results get transformed into characters.

Is there a easy way to accomplish that with the stack() or reshape() function? Or do I have to code a function that performes some strsplits commands on the column names to extract the relevant information and then store those in separate new columns.

The long format will definitely be much more easy to work with and ask questions to the data set. Coding something for >1000 columns with those cryptic columns names just looks like a nightmare to me.

Was it helpful?

Solution

Ok fine. I give in.

Here's your data:

df1_wide <- data.frame(v1 = c("Ind_A", "Ind_B", "Ind_C"), 
                       v2 = c("Y", "N", "N"), v3 = c("Y", "Y", "N"), 
                       v4 = c("op1_1", "opt_2", "opt_1"), 
                       v5 = c(100, 50, 200))
rownames(df1_wide) <- letters[1:3]
colnames(df1_wide) <- c("ID", "a_t12_aa (Type # 2)", "a_t12_bb (Type # 2)", 
                        "b_t12_aa (Type # 3)", "b_t13_aa (Type # 4)")
df1_wide
#      ID a_t12_aa (Type # 2) a_t12_bb (Type # 2) b_t12_aa (Type # 3) b_t13_aa (Type # 4)
# a Ind_A                   Y                   Y               op1_1                 100
# b Ind_B                   N                   Y               opt_2                  50
# c Ind_C                   N                   N               opt_1                 200

This is what you have done so far:

df1_long <- melt(df1_wide, id.vars="ID")

This is what it sounds like you want:

cbind(df1_long["ID"],
      colsplit(gsub("\\s|\\(Type|\\)|#", " ", df1_long$variable), 
               pattern="_|\\s+", 
               names = c("part", "task", "subgroup", "type")),
      df1_long["value"])
#       ID part task subgroup type value
# 1  Ind_A    a  t12       aa    2     Y
# 2  Ind_B    a  t12       aa    2     N
# 3  Ind_C    a  t12       aa    2     N
# 4  Ind_A    a  t12       bb    2     Y
# 5  Ind_B    a  t12       bb    2     Y
# 6  Ind_C    a  t12       bb    2     N
# 7  Ind_A    b  t12       aa    3 op1_1
# 8  Ind_B    b  t12       aa    3 opt_2
# 9  Ind_C    b  t12       aa    3 opt_1
# 10 Ind_A    b  t13       aa    4   100
# 11 Ind_B    b  t13       aa    4    50
# 12 Ind_C    b  t13       aa    4   200

I can't guarantee that the regular expression that I used with gsub will work with your actual data, but hopefully this is enough to point you in the right direction.

In the future, please note that sharing a screenshot of your data is pretty useless. Please share data as I did in this question, so that others can easily copy and paste to get started on trying to help you out.


You may want to consider that you rename your variables along the lines of a_t12_aa_2, in the future, in which case, the colsplit step would just be colsplit(variable, "_", c("part", "task", "subgroup", "type")).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top