Counting algorithm for big data in R

https://stackoverflow.com/questions/18771903

28-06-2022
|

Pergunta

I have a big data frame with almost 1M rows (transactions) and 2600 columns (items). The values in the data set are 1's and NA's. Data type of all the values are factor. I want to add a new column to the end of the data frame which shows sum of all 1's in each row.

Here is the R code that I wrote:

for(i in 1:nrow(dataset){
    counter<-0
    for(j in 1:ncol(dataset){
        if(!is.na(dataset[i,j])){
           counter<- counter+1
         }
     }
     dataset[i,ncol(dataset)+1]<-counter 
}

But it has been a very long time that it is running in R studio because the running time is O(n^2). I am wondering if there is any other way to do that or a way to improve this algorithm? (Machine has 80Gb of memory)

Solução 2

As eddi answer is the best in your case more general solution is to vectorize code (means: operate on all rows at once):

counter <- rep(0, nrow(dataset))
for(j in 1:ncol(dataset)) {
     counter <- counter + !is.na(dataset[[j]])
}
dataset$no_of_1s <- counter

One note: in your code in line:

dataset[i,ncol(dataset)+1]<-counter

you create new column for each row (cause for each step there is one more column), so final data.frame would have 1M rows and 1M colums (so it won't fit your memory).

Another option is to use Reduce

dataset$no_of_1s <- Reduce(function(a,b) a+!is.na(b), dataset, init=integer(nrow(dataset)))

Outras dicas

Using a matrix (of numbers, not factors), as @joran suggested, would be better for this, and simply do:

rowSums(your_matrix, na.rm = T)

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow