Finding the percentages of missing information in each column in parallel using bigmemory and parallel packages in R

StackOverflow https://stackoverflow.com/questions/23598404

Pregunta

Here's what I want to do:

> library(parallel)
> library(bigmemory)
> big.mat=read.big.matrix("cp2006.csv",header=T)
Warning messages:
1: In na.omit(as.integer(firstLineVals)) : NAs introduced by coercion
2: In na.omit(as.double(firstLineVals)) : NAs introduced by coercion
3: In read.big.matrix("cp2006.csv", header = T) :
  Because type was not specified, we chose double based on the first line of data.
> jobs <- lapply(1:10, function(x) mcparallel(colMeans(is.na(big.mat))*100, name = big.mat))
Error in as.character.default(name) : 
  no method for coercing this S4 class to a vector
> res  <- mccollect(jobs)

However the problem is that is.na is not apparently applicable to big.matrix objects. I did a search on web and found mwhich which is the parallel version of which in bigmemory but unfortunately couldn't find a good tutorial on it to find the missing (NA) values in the column. So I am not sure what function I should feed to my mcparallel to make it work with big.matrix objects. In addition:

> col.NA.mean<-colMeans(is.na(big.mat))*100
Error in colMeans(is.na(big.mat)) : 
  'x' must be an array of at least two dimensions
In addition: Warning message:
In is.na(big.mat) : is.na() applied to non-(list or vector) of type 'S4'
¿Fue útil?

Solución 2

I got the answer. When we call big.mat we should use [,] so here's the partial answer.

> colMeans(is.na(big.mat[,]))
             Year             Month        DayofMonth         DayOfWeek 
       0.00000000        0.00000000        0.00000000        0.00000000 
          DepTime        CRSDepTime           ArrTime        CRSArrTime 
       0.02102102        0.00000000        0.02402402        0.00000000 
    UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
       1.00000000        0.00000000        0.97997998        0.02402402 
   CRSElapsedTime           AirTime          ArrDelay          DepDelay 
       0.00000000        0.02402402        0.02402402        0.02102102 
           Origin              Dest          Distance            TaxiIn 
       1.00000000        1.00000000        0.00000000        0.00000000 
          TaxiOut         Cancelled  CancellationCode          Diverted 
       0.00000000        0.00000000        1.00000000        0.00000000 
     CarrierDelay      WeatherDelay          NASDelay     SecurityDelay 
       0.00000000        0.00000000        0.00000000        0.00000000 
LateAircraftDelay 
       0.00000000 

Here's the answer:

library(parallel)
library(bigmemory)
big.mat=read.big.matrix("cp2006.csv",header=T)
Warning messages:
1: In na.omit(as.integer(firstLineVals)) : NAs introduced by coercion
2: In na.omit(as.double(firstLineVals)) : NAs introduced by coercion
3: In read.big.matrix("cp2006.csv", header = T) :
Because type was not specified, we chose double based on the first line of data.
jobs <- lapply(1:10, function(x) mcparallel(colMeans(is.na(big.mat[,]))*100, name = big.mat))
Error in as.character.default(name) : 
no method for coercing this S4 class to a vector
jobs <- lapply(1:10, function(x) mcparallel(colMeans(is.na(big.mat[,]))*100, name = big.mat[,]))
res  <- mccollect(jobs)
> res
$`2006`
             Year             Month        DayofMonth         DayOfWeek 
         0.000000          0.000000          0.000000          0.000000 
          DepTime        CRSDepTime           ArrTime        CRSArrTime 
         2.102102          0.000000          2.402402          0.000000 
    UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
       100.000000          0.000000         97.997998          2.402402 
   CRSElapsedTime           AirTime          ArrDelay          DepDelay 
         0.000000          2.402402          2.402402          2.102102 
           Origin              Dest          Distance            TaxiIn 
       100.000000        100.000000          0.000000          0.000000 
          TaxiOut         Cancelled  CancellationCode          Diverted 
         0.000000          0.000000        100.000000          0.000000 
     CarrierDelay      WeatherDelay          NASDelay     SecurityDelay 
         0.000000          0.000000          0.000000          0.000000 
LateAircraftDelay 
         0.000000 

$`2006`
             Year             Month        DayofMonth         DayOfWeek 
         0.000000          0.000000          0.000000          0.000000 
          DepTime        CRSDepTime           ArrTime        CRSArrTime 
         2.102102          0.000000          2.402402          0.000000 
    UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
       100.000000          0.000000         97.997998          2.402402 
   CRSElapsedTime           AirTime          ArrDelay          DepDelay 
         0.000000          2.402402          2.402402          2.102102 
           Origin              Dest          Distance            TaxiIn 
       100.000000        100.000000          0.000000          0.000000 
          TaxiOut         Cancelled  CancellationCode          Diverted 
         0.000000          0.000000        100.000000          0.000000 
     CarrierDelay      WeatherDelay          NASDelay     SecurityDelay 
         0.000000          0.000000          0.000000          0.000000 
LateAircraftDelay 
         0.000000 

$`2006`
             Year             Month        DayofMonth         DayOfWeek 
         0.000000          0.000000          0.000000          0.000000 
          DepTime        CRSDepTime           ArrTime        CRSArrTime 
         2.102102          0.000000          2.402402          0.000000 
    UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
       100.000000          0.000000         97.997998          2.402402 
   CRSElapsedTime           AirTime          ArrDelay          DepDelay 
         0.000000          2.402402          2.402402          2.102102 
           Origin              Dest          Distance            TaxiIn 
       100.000000        100.000000          0.000000          0.000000 
          TaxiOut         Cancelled  CancellationCode          Diverted 
         0.000000          0.000000        100.000000          0.000000 
     CarrierDelay      WeatherDelay          NASDelay     SecurityDelay 
         0.000000          0.000000          0.000000          0.000000 
LateAircraftDelay 
         0.000000 

$`2006`
             Year             Month        DayofMonth         DayOfWeek 
         0.000000          0.000000          0.000000          0.000000 
          DepTime        CRSDepTime           ArrTime        CRSArrTime 
         2.102102          0.000000          2.402402          0.000000 
    UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
       100.000000          0.000000         97.997998          2.402402 
   CRSElapsedTime           AirTime          ArrDelay          DepDelay 
         0.000000          2.402402          2.402402          2.102102 
           Origin              Dest          Distance            TaxiIn 
       100.000000        100.000000          0.000000          0.000000 
          TaxiOut         Cancelled  CancellationCode          Diverted 
         0.000000          0.000000        100.000000          0.000000 
     CarrierDelay      WeatherDelay          NASDelay     SecurityDelay 
         0.000000          0.000000          0.000000          0.000000 
LateAircraftDelay 
         0.000000 

$`2006`
             Year             Month        DayofMonth         DayOfWeek 
         0.000000          0.000000          0.000000          0.000000 
          DepTime        CRSDepTime           ArrTime        CRSArrTime 
         2.102102          0.000000          2.402402          0.000000 
    UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
       100.000000          0.000000         97.997998          2.402402 
   CRSElapsedTime           AirTime          ArrDelay          DepDelay 
         0.000000          2.402402          2.402402          2.102102 
           Origin              Dest          Distance            TaxiIn 
       100.000000        100.000000          0.000000          0.000000 
          TaxiOut         Cancelled  CancellationCode          Diverted 
         0.000000          0.000000        100.000000          0.000000 
     CarrierDelay      WeatherDelay          NASDelay     SecurityDelay 
         0.000000          0.000000          0.000000          0.000000 
LateAircraftDelay 
         0.000000 

$`2006`
             Year             Month        DayofMonth         DayOfWeek 
         0.000000          0.000000          0.000000          0.000000 
          DepTime        CRSDepTime           ArrTime        CRSArrTime 
         2.102102          0.000000          2.402402          0.000000 
    UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
       100.000000          0.000000         97.997998          2.402402 
   CRSElapsedTime           AirTime          ArrDelay          DepDelay 
         0.000000          2.402402          2.402402          2.102102 
           Origin              Dest          Distance            TaxiIn 
       100.000000        100.000000          0.000000          0.000000 
          TaxiOut         Cancelled  CancellationCode          Diverted 
         0.000000          0.000000        100.000000          0.000000 
     CarrierDelay      WeatherDelay          NASDelay     SecurityDelay 
         0.000000          0.000000          0.000000          0.000000 
LateAircraftDelay 
         0.000000 

$`2006`
             Year             Month        DayofMonth         DayOfWeek 
         0.000000          0.000000          0.000000          0.000000 
          DepTime        CRSDepTime           ArrTime        CRSArrTime 
         2.102102          0.000000          2.402402          0.000000 
    UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
       100.000000          0.000000         97.997998          2.402402 
   CRSElapsedTime           AirTime          ArrDelay          DepDelay 
         0.000000          2.402402          2.402402          2.102102 
           Origin              Dest          Distance            TaxiIn 
       100.000000        100.000000          0.000000          0.000000 
          TaxiOut         Cancelled  CancellationCode          Diverted 
         0.000000          0.000000        100.000000          0.000000 
     CarrierDelay      WeatherDelay          NASDelay     SecurityDelay 
         0.000000          0.000000          0.000000          0.000000 
LateAircraftDelay 
         0.000000 

$`2006`
             Year             Month        DayofMonth         DayOfWeek 
         0.000000          0.000000          0.000000          0.000000 
          DepTime        CRSDepTime           ArrTime        CRSArrTime 
         2.102102          0.000000          2.402402          0.000000 
    UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
       100.000000          0.000000         97.997998          2.402402 
   CRSElapsedTime           AirTime          ArrDelay          DepDelay 
         0.000000          2.402402          2.402402          2.102102 
           Origin              Dest          Distance            TaxiIn 
       100.000000        100.000000          0.000000          0.000000 
          TaxiOut         Cancelled  CancellationCode          Diverted 
         0.000000          0.000000        100.000000          0.000000 
     CarrierDelay      WeatherDelay          NASDelay     SecurityDelay 
         0.000000          0.000000          0.000000          0.000000 
LateAircraftDelay 
         0.000000 

$`2006`
             Year             Month        DayofMonth         DayOfWeek 
         0.000000          0.000000          0.000000          0.000000 
          DepTime        CRSDepTime           ArrTime        CRSArrTime 
         2.102102          0.000000          2.402402          0.000000 
    UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
       100.000000          0.000000         97.997998          2.402402 
   CRSElapsedTime           AirTime          ArrDelay          DepDelay 
         0.000000          2.402402          2.402402          2.102102 
           Origin              Dest          Distance            TaxiIn 
       100.000000        100.000000          0.000000          0.000000 
          TaxiOut         Cancelled  CancellationCode          Diverted 
         0.000000          0.000000        100.000000          0.000000 
     CarrierDelay      WeatherDelay          NASDelay     SecurityDelay 
         0.000000          0.000000          0.000000          0.000000 
LateAircraftDelay 
         0.000000 

$`2006`
             Year             Month        DayofMonth         DayOfWeek 
         0.000000          0.000000          0.000000          0.000000 
          DepTime        CRSDepTime           ArrTime        CRSArrTime 
         2.102102          0.000000          2.402402          0.000000 
    UniqueCarrier         FlightNum           TailNum ActualElapsedTime 
       100.000000          0.000000         97.997998          2.402402 
   CRSElapsedTime           AirTime          ArrDelay          DepDelay 
         0.000000          2.402402          2.402402          2.102102 
           Origin              Dest          Distance            TaxiIn 
       100.000000        100.000000          0.000000          0.000000 
          TaxiOut         Cancelled  CancellationCode          Diverted 
         0.000000          0.000000        100.000000          0.000000 
     CarrierDelay      WeatherDelay          NASDelay     SecurityDelay 
         0.000000          0.000000          0.000000          0.000000 
LateAircraftDelay 
         0.000000 

> 

Otros consejos

This is just a part answer. is.na appears to work fine.

library(bigmemory)

Some data, from examples in ?big.matrix

x <- big.matrix(10, 2, type='integer', init=-5)
options(bigmemory.allow.dimnames=TRUE)
colnames(x) <- c("alpha", "beta")
is.big.matrix(x)
dim(x)
colnames(x)
rownames(x)

Set some to missing

x[1,] <- NA
x[,]
#      alpha beta
# [1,]    NA   NA
# [2,]    -5   -5
# ...

 is.na(x[,] )
 #       alpha  beta
 # [1,]  TRUE  TRUE
 # [2,] FALSE FALSE
 # ...

 y <- as.big.matrix(is.na(x[,]))
# Warning message:
# In as.big.matrix(is.na(x[, ])) : Casting to numeric type

is.big.matrix(y)
# [1] TRUE

 y[,]
#      alpha beta
# [1,]     1    1
# [2,]     0    0
# [3,]     0    0
# [4,]     0    0
# [5,]     0    0
# [6,]     0    0
# [7,]     0    0
# [8,]     0    0
# [9,]     0    0
#[10,]     0    0

 colMeans(y[,])
# alpha  beta 
#  0.1   0.1 

So i think you need to add the [,] after big.mat.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top