How to collapse very large sparse dataframes

https://stackoverflow.com/questions/21977746

15-10-2022
|

문제

I want to sum about 10000 columns like colSparseX on 1500 sparse rows of an dataframe. If I have the input:

(I tried on OriginalDataframe this:

coldatfra <- aggregate(. ~colID,datfra,sum)

and this:

coldatfra <- ddply(datfra, .(colID), numcolwise(sum))

But it doesn't work!)

colID <- c(rep(seq(1:6),2), rep(seq(1:2),3))
colSparse1 <- c(rep(1,5), rep(0,4), rep(1,2), rep(0,5), rep(1,2))
cPlSpars2 <- c(rep(1,3), rep(0,6), rep(1,2), rep(0,5), rep(1,2))
coMSparse3 <- c(rep(1,6), rep(0,3), rep(1,2), rep(0,5), rep(1,2))
colSpArseN <- c(rep(1,2), rep(0,7), rep(1,2), rep(0,5), rep(1,2))

(datfra <- data.frame(colID, colSparse1, cPlSpars2, coMSparse3, colSpArseN))

colID colSparse1 cPlSpars2 coMSparse3 colSpArseN
    1          1         1          1          1
    2          1         1          1          1
    3          1         1          1          0
    4          1         0          1          0
    5          1         0          1          0
    6          0         0          1          0
    1          0         0          0          0
    2          0         0          0          0
    3          0         0          0          0
    4          1         1          1          1
    5          1         1          1          1
    6          0         0          0          0
    1          0         0          0          0
    2          0         0          0          0
    1          0         0          0          0
    2          0         0          0          0
    1          1         1          1          1
    2          1         1          1          1

And want to sum the elements for each ID on all (10000 columns - requires some placeholder for colnames are very variable words) colSparses in order to get this:

colID colSparse1 cPlSpars2 coMSparse3 colSpArseN
    1          2         2          2          2
    2          2         2          2          2
    3          1         1          1          0
    4          2         1          2          1
    5          2         1          2          1
    6          0         0          1          0

Note: str(OriginalDataframe)

'data.frame':   1500 obs. of  10000 variables:
 $ someword                                                : num  0 0 0 0 0 0 0 0 0 0 ...
 $ anotherword                                             : num  0 0 0 0 0 0 0 0 0 0 ...

And on a smaller version (which was terminated) of the OriginalDataframe treated with ddply(datfra, .(colID), numcolwise(sum)) I get:

     colID colSparse1 cPlSpars2 coMSparse3 colSpArseN
1     0019          0         0          0          0
NA    <NA>         NA        NA         NA         NA
NA.1  <NA>         NA        NA         NA         NA
NA.2  <NA>         NA        NA         NA         NA
NA.3  <NA>         NA        NA         NA         NA

해결책

Take a look at my answer to this question: Mean per group in a data.frame

Your question is similar. If you change the function being applied from mean to sum, you get what you are looking for.

colstosum <- names(mydt)[2:5]
mydt.sum <- mydt[,lapply(.SD,sum,na.rm=TRUE),by=colID,.SDcols=colstosum]

mydt.sum
   colID colSparse1 cPlSpars2 coMSparse3 colSpArseN
1:     1          2         2          2          2
2:     2          2         2          2          2
3:     3          1         1          1          0
4:     4          2         1          2          1
5:     5          2         1          2          1
6:     6          0         0          1          0

Granted, I can't guarantee the speed or lack thereof of sum on a large data.table. Also, there is a way you should be able to incorporate colSums in the lapply function, but I can't figure out the syntax at the moment.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow