Question

I am looking for a solution to expand a large data frame in R into more columns and more rows, given the values of a given column.

Right now I am doing this using a for-loop approach but I am sure there are more crantastic/efficient ways to achieve the same results...

The example will make the question more clear I think. Let us imagine we have a data frame containing student's information about their grades at three different stages in life. The student IDs are s1, s2 and s3; and we have measurements of their grades in three different times in their life, m1, m2, and m3; and then at each stage we have a column called more.info with their grades in their courses, encoded as class#topic#grade across all classes taken.

library(stringr)
options(stringsAsFactors=FALSE)
example.df = data.frame(measure.id = c("m1", "m2", "m3", "m2", "m2", "m3", "m1", "m1", "m3"),
                        student.id = c("s1", "s1", "s1", "s2", "s3", "s3", "s2", "s3", "s2"),
                        more.info = c("draw#drawing#4.5;music#singing#5.6;dance#ballet#6.7", "bio#biology#5.6;math#algebra#4.5", "calculus#univariate#6.2; physics#quantum#4.5;chemistry#organic#4.5", 
                                      "bio#biology#5.6;math#algebra#4.5", "bio#biology#3.6;math#algebra#3.5", "calculus#univariate#5.2; physics#quantum#5.2;chemistry#organic#4", "draw#drawing#5;music#singing#5.6;dance#ballet#5.7", 
                                      "draw#drawing#2.5;music#singing#3.6;dance#ballet#4", "calculus#univariate#5.2; physics#quantum#6.5;chemistry#organic#5"))
measure.ids = unique(example.df$measure.id)

Then, I'd like to create a new data frame that splits the more.info information and creates a new data frame with more rows and more columns as follows,

new.df=data.frame()
splitit <- function(x){
  strsplit(x, '#')
}
for(i in 1:length(measure.ids)){
  measure.id = measure.ids[i]
  tmp = example.df[example.df==measure.id,]
  more.info = tmp$more.info
  more.info = strsplit(more.info,";")
  student.ids = tmp$student.id
  for(j in 1:length(more.info))
  {
    info = more.info[[j]]
    a = sapply(info, splitit)
    b = sapply(a, "[[", 1)
    d = sapply(a, "[[", 2)
    e = sapply(a, "[[", 3)
    new.df = rbind(new.df, 
                   data.frame(measure.id = rep(measure.id, length(info)),
                              student.id = rep(tmp$student.id[j], length(info)),
                              class = b, 
                              topic = d,
                              grade = e)
                   )
  }
}

What'd be the most efficient way to achieve this in R? I am open to apply functions, map/reduce approaches, mclapply for using more cores, etc...

Was it helpful?

Solution

Solution with base functions:

# split column by all available separators 
a <- strsplit(example.df$more.info, "; |#|;")
# represent each result as a matrix with 3 columns
a <- lapply(a, function(v) matrix(v, ncol=3, byrow=TRUE))
# combine all matrixes in one big matrix
aa <- do.call(rbind, a)
# create indices of rows of initial data.frame which corresponds to the created big matrix
b <- unlist(sapply(seq_along(a), function(i) rep(i, nrow(a[[i]]))))
# combine initial data.frame and created big matrix
df <- cbind(example.df[b,], aa)
# remove unnecessary columns and rename remaining ones
df <- df[,-3]
colnames(df)[3:5] <- c("class", "topic", "grade")

To increase the speed you may replace all functions of apply family in my code with mclapply.

I cannot compare the speed since your dataset is very small.

OTHER TIPS

Here is another approach using data.table.

Basically, I put the entire data transformation procedure in one line.

# Load R package
library(data.table)    

# Convert to data.table object
example.dt <- as.data.table(example.df)

# Transform the data
final.dt <- example.dt[, data.table(do.call(rbind, unlist(lapply(strsplit(x=more.info, split=";"), strsplit, "#"), recursive=FALSE))), by=c("measure.id", "student.id")]

# Rename variables
setnames(final.dt, old=c("V1", "V2", "V3"), new=c("class", "topic", "grade"))


# > final.dt
#     measure.id student.id class   topic grade
#  1:         m1         s1  draw drawing   4.5
#  2:         m1         s1 music singing   5.6
#  3:         m1         s1 dance  ballet   6.7
#  4:         m2         s1 dance drawing   5.6
#  5:         m2         s1  draw  ballet   4.5
#  6:         m3         s1  draw singing   5.6
#  7:         m3         s1 dance drawing   4.5
#  8:         m3         s1 music  ballet   4.5
#  9:         m2         s2 dance drawing   5.6
# 10:         m2         s2  draw  ballet   4.5
# 11:         m2         s3 dance drawing   5.6
# 12:         m2         s3  draw  ballet   4.5
# 13:         m3         s3  draw singing   5.6
# 14:         m3         s3 dance drawing   5.6
# 15:         m3         s3 music  ballet   4.5
# 16:         m1         s2  draw drawing   4.5
# 17:         m1         s2 music singing   5.6
# 18:         m1         s2 dance  ballet   6.7
# 19:         m1         s3  draw drawing   4.5
# 20:         m1         s3 music singing   5.6
# 21:         m1         s3 dance  ballet   6.7
# 22:         m3         s2  draw singing   5.6
# 23:         m3         s2 dance drawing   6.7
# 24:         m3         s2 music  ballet   4.5
#     measure.id student.id class   topic grade

This answer has some approaches that might be useful for speeding up your code (e.g mclapply and the data.table package).

require("data.table")
require("parallel")
require("plyr")

#Note the mclapply function.  If you
#are running Mac or Linux, this should be more efficient for you
list.of.dfs <- mclapply(strsplit(example.df$more.info, "; |#|;"),FUN=function(x) as.data.frame(t(x)),mc.cores=1)
combined.df <- rbind.fill(list.of.dfs)


#Use data.table for speed and efficiency.
#example.df <- data.table(cbind(example.df,combined.df))
example.df <- data.table(example.df)
example.df[,paste0(c("class","topic","grade"),
             c(rep(1,3),rep(2,3),rep(3,3))):=lapply(combined.df,I)]

#delete unnecessary column
example.df[,more.info:=NULL]


#rbindlist final table (efficient way to rbind)
table1 <- example.df[,list(measure.id,student.id,class=class1,topic=topic1,grade=grade1)]
table2 <- example.df[,list(measure.id,student.id,class=class2,topic=topic2,grade=grade2)]
table3 <- example.df[,list(measure.id,student.id,class=class3,topic=topic3,grade=grade3)]

#final results
final.table <- rbindlist(list(table1,table2,table3))[!is.na(class)]
final.table

Perhaps you can try out two functions that I'm written, concat.split.DT and cSplit. Both are presently available as GitHub Gists which can easily be loaded with the "devtools" package.

library(devtools)
source_gist(6873058)  # for concat.split.DT
source_gist(11380733) # for cSplit

concat.split.DT(cSplit(example.df, splitCols="more.info", sep=";", direction="long"), 
                splitcols="more.info", sep="#")
#     measure.id student.id more.info_1 more.info_2 more.info_3
#  1:         m1         s1        draw     drawing         4.5
#  2:         m1         s1       music     singing         5.6
#  3:         m1         s1       dance      ballet         6.7
#  4:         m2         s1         bio     biology         5.6
#  5:         m2         s1        math     algebra         4.5
#  6:         m3         s1    calculus  univariate         6.2
#  7:         m3         s1     physics     quantum         4.5
#  8:         m3         s1   chemistry     organic         4.5
#  9:         m2         s2         bio     biology         5.6
# 10:         m2         s2        math     algebra         4.5
# 11:         m2         s3         bio     biology         3.6
# 12:         m2         s3        math     algebra         3.5
# 13:         m3         s3    calculus  univariate         5.2
# 14:         m3         s3     physics     quantum         5.2
# 15:         m3         s3   chemistry     organic         4.0
# 16:         m1         s2        draw     drawing         5.0
# 17:         m1         s2       music     singing         5.6
# 18:         m1         s2       dance      ballet         5.7
# 19:         m1         s3        draw     drawing         2.5
# 20:         m1         s3       music     singing         3.6
# 21:         m1         s3       dance      ballet         4.0
# 22:         m3         s2    calculus  univariate         5.2
# 23:         m3         s2     physics     quantum         6.5
# 24:         m3         s2   chemistry     organic         5.0
#     measure.id student.id more.info_1 more.info_2 more.info_3

The column names can be easily changed later on with setnames.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top