Question

I have a problem with the way R coerces variable types when using rbind of two data.frames with NA values. I illustrate by example:

x<-factor(sample(1:3,10,T))
y<-rnorm(10)
dat<-data.frame(x,y)
NAs<-data.frame(matrix(NA,ncol=ncol(dat),nrow=nrow(dat)))
colnames(NAs)<-colnames(dat)

Now the goal is to append dat and NAs while keeping the variable types factor and numeric of x and y. When I give:

dat_forward<-rbind(dat,NAs)
is.factor(dat_forward$x)

this works fine. However the backward direction using rbind fails:

dat_backward<-rbind(NAs,dat)
is.factor(dat_backward$x)
is.character(dat_backward$x)

Now x is coerced to character level. I am confused - can't it stay factor type even if I use the other order of binding? What would be a straight forward change to my code to reach my goal?

Was it helpful?

Solution

Here's a fairly simple way to get the column classes right:

x <- rbind(dat[1,], NAs, dat)[-1,]
str(x)
#  $ x: Factor w/ 3 levels "1","2","3": NA NA NA NA NA NA NA NA NA NA ...
#  $ y: num  NA NA NA NA NA NA NA NA NA NA ...

More generally, if you are really needing this often, you could create an rbind-like function that takes an additional argument indicating the data.frame to whose column classes you'd like to coerce all of the others' columns:

myrbind <- function(x, ..., template=x) {
    do.call(rbind, c(list(template[1,]), list(x), list(...)))[-1,]
}

str(myrbind(NAs, dat,  template=dat))
# 'data.frame': 20 obs. of  2 variables:
#  $ x: Factor w/ 3 levels "1","2","3": NA NA NA NA NA NA NA NA NA NA ...
#  $ y: num  NA NA NA NA NA NA NA NA NA NA ...

## If no 'template' argument is supplied, myrbind acts just like rbind    
str(myrbind(dat, NAs))
# 'data.frame': 20 obs. of  2 variables:
#  $ x: Factor w/ 3 levels "1","2","3": 3 3 3 3 2 3 1 1 3 2 ...
#  $ y: num  0.303 1.77 -1.38 1.731 0.033 ...

OTHER TIPS

Similarly, you could just convert the column in NAs to factor

NAs$x<-factor(NAs$x)
dat_backward<-rbind(NAs,dat) 
is.factor(dat_backward$x) # TRUE
is.character(dat_backward$x) # FALSE

data.frame does a lot of things incorrectly when rbind'ing different types together, and especially when that involves factors. Start using data.table (1.8.11+) instead and you won't have these issues:

library(data.table)
dt1 = data.table(dat)
dt2 = data.table(NAs)

sapply(rbind(dt1, dt2), class)
#        x         y 
# "factor" "numeric" 
sapply(rbind(dt2, dt1), class)
#        x         y 
# "factor" "numeric" 

From ?rbind.data.frame, we read: "It then takes the classes of the columns from the first data frame...". This is why you're seeing the order matter in your call to rbind.

To get the variable classes of dat_forward with the ordering of dat_backward, you could just construct dat_forward and reorder the rows:

dat_new = rbind(dat, NAs)[c((nrow(dat)+1):(nrow(dat)+nrow(NAs)), 1:nrow(dat)),]
str(dat_new)
# 'data.frame': 20 obs. of  2 variables:
#  $ x: Factor w/ 3 levels "1","2","3": NA NA NA NA NA NA NA NA NA NA ...
#  $ y: num  NA NA NA NA NA NA NA NA NA NA ...

One approach would be to create NAs with the correct column datatypes. This can be easily done with

NAs <- dat[NA,]

You can also make as many rows as desired with

num.rows <- 30
NAs <- dat[NA,][1:num.rows,]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top