Directly creating dummy variable set in a sparse matrix in R

https://stackoverflow.com//questions/23035982

21-12-2019
|

Question

Suppose you have a data frame with a high number of columns(1000 factors, each with 15 levels). You'd like to create a dummy variable data set, but since it would be too sparse, you would like to keep dummies in sparse matrix format.

My data set is quite big and the less steps there are, the better for me. I know how to do above steps; but I couldn't get my head around directly creating that sparse matrix from the initial data set, i.e. having one step instead of two. Any ideas?

EDIT: Some comments asked for further elaboration, so here it goes:

Where X is my original data set with 1000 columns and 50000 records, each column having 15 levels,

Step1: Creating dummy variables from the original data set with a code like;

# Creating dummy data set with empty values
dummified <- matrix(NA,nrow(X),15*ncol(X))
# Adding values to this data set for each column and each level within columns
for (i in 1:ncol(X)){colFactr <- factor(X[,i],exclude=NULL)
  for (j in 1:l){
    lvl <- levels(colFactr)[j]
    indx <- ((i-1)*l)+j
    dummified[,indx] <- ifelse(colFactr==lvl,1,0)
  }
}

Step2: Converting that huge matrix into a sparse matrix, with a code like;

sparse.dummified <- sparseMatrix(dummified)

But this approach still created this interim large matrix which takes a lot of time & memory, therefore I am asking the direct methodology (if there is any).

Solution

Thanks for having clarified your question, try this.

Here is sample data with two columns that have three and two levels respectively:

set.seed(123)
n <- 6
df <- data.frame(x = sample(c("A", "B", "C"), n, TRUE),
                 y = sample(c("D", "E"),      n, TRUE))
#   x y
# 1 A E
# 2 C E
# 3 B E
# 4 C D
# 5 C E
# 6 A D

library(Matrix)
spm <- lapply(df, function(j)sparseMatrix(i = seq_along(j),
                                          j = as.integer(j), x = 1))
do.call(cBind, spm)
# 6 x 5 sparse Matrix of class "dgCMatrix"
#               
# [1,] 1 . . . 1
# [2,] . . 1 . 1
# [3,] . 1 . . 1
# [4,] . . 1 1 .
# [5,] . . 1 . 1
# [6,] 1 . . 1 .

Edit: @user20650 pointed out do.call(cBind, ...) was sluggish or failing with large data. So here is a more complex but much faster and efficient approach:

n <- nrow(df)
nlevels <- sapply(df, nlevels)
i <- rep(seq_len(n), ncol(df))
j <- unlist(lapply(df, as.integer)) +
     rep(cumsum(c(0, head(nlevels, -1))), each = n)
x <- 1
sparseMatrix(i = i, j = j, x = x)

OTHER TIPS

This can be done slightly more compactly with Matrix:::sparse.model.matrix, although the requirement to have all columns for all variables makes things a little more difficult.

Generate input:

set.seed(123)
n <- 6
df <- data.frame(x = sample(c("A", "B", "C"), n, TRUE),
                 y = sample(c("D", "E"),      n, TRUE))

If you didn't need all columns for all variables you could just do:

library(Matrix)
sparse.model.matrix(~.-1,data=df)

If you need all columns:

fList <- lapply(names(df),reformulate,intercept=FALSE)
mList <- lapply(fList,sparse.model.matrix,data=df)
do.call(cBind,mList)

Adding comment as an answer as it seems a bit faster and more scalable (at least on my PC (ubuntu R3.1.0)

Matrix(model.matrix(~ -1 + . , data=df, 
         contrasts.arg = lapply(df, contrasts, contrasts=FALSE)),sparse=TRUE)

Test with larger data

library(Matrix)
library(microbenchmark)

set.seed(123)
df <- data.frame(replicate(200,sample(letters[1:15], 100, TRUE)))

ben <- function() {
  fList <- lapply(names(df),reformulate,intercept=FALSE)
  do.call(cBind,lapply(fList,sparse.model.matrix,data=df))
  }


flodel <- function(){
  do.call(cBind,lapply(df, function(j)sparseMatrix(i = seq_along(j),
                                        j = as.integer(j), x = 1)))    
   }


user <- function(){
  Matrix(model.matrix(~ -1 + . , data=df, 
                  contrasts.arg = lapply(df, contrasts, contrasts=FALSE)),
     sparse=TRUE)
   }


    microbenchmark(flodel(), flodel2(), ben(), user(),times=10)
# Unit: milliseconds
 #     expr        min         lq    median         uq        max neval
  # flodel() 1002.79714 1005.70631 1100.1874 1179.84403 1192.56583    10
  # flodel2()   16.62579   17.37707   18.5620   18.72137   19.19888    10
  #     ben() 1602.80193 1612.45177 1616.6684 1703.16246 1709.90557    10
  #    user()   96.80575   97.37132  101.9881  104.00750  195.87784    10

Edit adding in flodel's update - its clear - v. nice

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow