Condensing multiple observations on the same individual into a single row, adding multiples as new columns

https://stackoverflow.com/questions/21538244

r
reshape

06-10-2022
|

Question

I have data that looks like this:

ID  X1    X2    X3

1   1.4   2     two
2   7.6   30    thirty
2   7.6   50    fifty
2   7.6   40    forty
3   5.6   40    forty
3   5.6   50    fifty
4   3.5   NA    NA
5   NA    2     two

ID indicates individuals, X1 is a measurement taken once per individual, and X2 and X3 are character and string representations of the same treatment. E.g., for individual 2, X1= 7.6, and was subjected to treatment X2 = 30, 50, and 40 (which has associated other info, X3 = thirty, fifty, and forty)

I want it to end up looking like this, with only one row per individual, and values from multiple observations on the same individual added as new columns:

ID   X1     X2a   X3a       X2b   X3b     X2c  X3c
1    1.4    2     two       NA    NA      NA   NA
2    7.6    30    thirty    50    fifty   40   forty
3    5.6    40    forty     50    fifty   NA   NA
4    3.5    NA    NA        NA    NA      NA   NA
5    NA     2     two       NA    NA      NA   NA

I want to keep rows with X1=NA (like subject 5), because those rows have measurements of interest in other columns that are mostly excluded for the simplicity of my example. My actual dataset is pretty big (10,000s of rows, and maybe 50 columns), so a method that can efficiently handle a fairly big dataset would be a plus.

I've been playing with melt() and cast() from the reshape package, which I think can do what I want. But, I am getting an error requiring me to aggregate (Error: Aggregation requires fun.aggregate: length used as default), which I don't want to do. Suggestions for getting melt and cast to cooperate? or a better way to do this?

I've found a number of similar questions (for example, this one), but none seem to address my problem as they don't start with a similar data structure or they don't care which instances of the data end up matched with which...

Solution

Here's one option:

library(data.table)
dt = data.table(your_df)

# get number of columns first (6 here)
max.N = max(dt[, .N*ncol(.SD), by = list(ID, X1)]$V1)

# now construct the result by filling in appropriate # of NA's
dt[, as.list(c(t(.SD), rep(NA, max.N - .N*ncol(.SD)))), by = list(ID, X1)]
#   ID  X1 V1     V2 V3    V4 V5    V6
#1:  1 1.4  2    two NA    NA NA    NA
#2:  2 7.6 30 thirty 50 fifty 40 forty
#3:  3 5.6 40  forty 50 fifty NA    NA
#4:  4 3.5 NA     NA NA    NA NA    NA
#5:  5  NA  2    two NA    NA NA    NA

OTHER TIPS

As of version 1.9.6 (on CRAN 2015-09-19), data.table's dcast() function can cast multiple value.var columns simultaneously.

dcast(DT, ID + X1 ~ rowid(ID), value.var = c("X2", "X3"))

   ID  X1 X2_1 X2_2 X2_3   X3_1  X3_2  X3_3
1:  1 1.4    2   NA   NA    two    NA    NA
2:  2 7.6   30   50   40 thirty fifty forty
3:  3 5.6   40   50   NA  forty fifty    NA
4:  4 3.5   NA   NA   NA     NA    NA    NA
5:  5  NA    2   NA   NA    two    NA    NA

Data

library(data.table)
DT <- fread(
"ID  X1    X2    X3
  1   1.4   2     two
  2   7.6   30    thirty
  2   7.6   50    fifty
  2   7.6   40    forty
  3   5.6   40    forty
  3   5.6   50    fifty
  4   3.5   NA    NA
  5   NA    2     two"
)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow