Question

I'm pretty new to R, but it seems that this is a specific problem to which I have not been able to find an answer.

My program reads in some data, then rbinds certain columns of that data to one of several data frames based on a vector of column numbers I pass it, so something like this:

filename <- c("vector", "full", "of", "filenames")
colVal <- (32)    
InMat <- data.frame()
for (i in 1:length(filename)){
  file <- read.table(filename[i], header=TRUE, fill=TRUE, stringsAsFactors=FALSE)
  InMat <- rbind(InMat, file[c(2:dim(file)[1], colVal)])
  #...other matricies...
}

My issue lies in the case where there is only one desired column, i.e. colVal takes one value. In this case, I find that InMat is essentially transposed from what I would require. Worse, when I read in mulitple files, it rbinds the transposed desired column, so I get a number of rows equal to the number of files I'm reading, with as many columns as there are rows in each desired column of each file.

It seems that if there are 2 desired columns (i.e. colVal takes two or more values), then it acts as I expect (i.e. a column is read and stored in InMat as a column, columns from each additional file are stored below).

My question is why does rbind act differently when only one desired column value is passed to it, and if there is an easy way (read; not adding some clunky if or for loop to check) to avoid this?

Thanks!

Was it helpful?

Solution

Short answer: [.data.frame (the [ operator on data frames) by default converts output to the lowest possible dimension (via the argument drop=TRUE). If you're pulling just one column then it converts to a vector, which then creates a matrix with other vectors via rbind into a matrix. When you extract two or more columns, you get a data frame, so the output of rbind is a data frame.

The quick fix is to change this line:

InMat <- rbind(InMat, file[c(2:dim(file)[1], colVal)]) #old line
InMat <- rbind(InMat, file[c(2:dim(file)[1], colVal),drop=FALSE]) #new line

A more R-like way of coding this would be to use lapply and call rbind once. Because R is assign-by-copy, growing objects by repeated concatenating/adding is quite inefficient (see the second circle of the R Inferno).

filename <- c("vector", "full", "of", "filenames")
colVal <- (32)    
dfm <- lapply(filename, read.table
  , header=TRUE, fill=TRUE, stringsAsFactors=FALSE)
dfm <- lapply(dfm,`[`,colVal)
dfm <- do.call(rbind,dfm)

If you know the positions of the columns you want to extract beforehand, you could use the colClasses argument of read.table and skip over reading the entire table:

filename <- c("vector", "full", "of", "filenames")
colVal <- 32
cc <- rep.int("NULL",40) #where 40 is # of columns in table
cc[colVal] <- NA 
dfm <- lapply(filename, read.table
  , header=TRUE, fill=TRUE, colClasses=cc, stringsAsFactors=FALSE)
dfm <- do.call(rbind,dfm)

OTHER TIPS

When you take only one column it becomes a vector. It would be better if you just appended all the values into a vector instead of a matrix

InVec <- c()
for (i in 1:length(filename)){
  file <- read.table(filename[i], header=TRUE, fill=TRUE, stringsAsFactors=FALSE)
  InVec <- c(InVec, file[-1, colVal)])
  #...other matricies...
}

Using c() will be much faster than rbind as well

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top