Вопрос

I have the following data frame and I would like to create a new one that will be like the one below.

     ID1 ID2 ID3 ID4
x1_X 0   10  4   7
x2_X 2   12  5   8
x3_X 3   1   3   5
y1_Y 4   13  6   4
y2_Y 5   14  1   9
y3_Y 2   11  1   5
y4_Y 1   1   2   3
z1_Z 1   0   0   5
z2_Z 3   6   7   7

New data frame

    ID1 ID2 ID3 ID4
X   x3 x2 x2 x2
Y   y2 y2 y1 y2
Z   z2 z2 z2 z2

Basically the idea is the following: For each ID I want to find which of the rownames (x1_X,x2_X,x3_X) has the most extreme value and assign this to name X since in the rownames I have subgroups.

My data frame is huge: 1700 columns and 100000 rows.

Это было полезно?

Решение

First we need to split the group and subgroup labels:

grp <- strsplit(row.names(df), "_")

And if performance is an issue, I think data.table is our best choice:

library(data.table)
df$group <- sapply(grp, "[", 2)
subgroup <- sapply(grp, "[", 1)
dt <- data.table(df)

And we now have access to the single line:

result <- dt[,lapply(.SD, function(x)  subgroup[.I[which.max(x)]]), by=group]

Which splits the data.table by the character after the underscore (by=group) and then, for every column of the rectangular subset (.SD) we get the index in the sub-rectangle (which.max), and then map it back to the whole data.table (.I), and then extract the relevant subgroup (subgroup).

The data.table package is meant to be quite efficient, though you might want to look into indexing your data.table if you're going to be querying it multiple times.

Другие советы

Your table:

df <- read.table (text= "     ID1 ID2 ID3 ID4
x1_X 0   10  4   7
x2_X 2   12  5   8
x3_X 3   1   3   5
y1_Y 4   13  6   4
y2_Y 5   14  1   9
y3_Y 2   11  1   5
y4_Y 1   1   2   3
z1_Z 1   0   0   5
z2_Z 3   6   7   7", header = T)

Split rownames to get groups:

library(plyr)
df_names <- ldply(strsplit (rownames(df), "_"))
colnames(df_names) <- c ("group1", "group2")

df2 <- cbind (df, df_names)

Create new table:

df_new <- data.frame (matrix(nrow = length(unique (df2$group2)), 
                        ncol = ncol(df)))
colnames(df_new) <- colnames(df)
rownames (df_new) <- unique (df_names[["group2"]])

Filling new table with a loop:

for (i in 1:ncol (df_new)) {

  for (k in 1:nrow (df_new)) {

    col0 <- colnames (df_new)[i]
    row0 <- rownames (df_new)[k]

    sub0 <- df2 [df2$group2 == row0, c(col0, "group1")]
    df_new [k,i] <- sub0 [sub0[1]==max (sub0[1]), 2]
  }

}
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top