Question

A public dataset contains a factor level (e.g., "(0) Omitted"), that I would like to recode as an NA. Ideally, I'd like to be able to scrub an entire subset at once. I'm using the data.table package and am wondering if there is a better or faster way of accomplishing this than converting the values to characters, dropping the character, and then converting the data to factors.

library(data.table)
DT <- data.table(V1=factor(sample(LETTERS,size = 2000000,replace=TRUE)),
                V2 = factor(sample(LETTERS,size = 2000000,replace=TRUE)),
                V3 = factor(sample(LETTERS,size = 2000000,replace=TRUE)))

# Convert to character
DT1 <- DT[, lapply(.SD, as.character)]
DT2 <- copy(DT1)
DT3 <- copy(DT) # Needs to be factor

# Scrub all 'B' values
DT1$V1[DT1$V1=="B"] <- NA
# Works!

DT2[V1 == "B", V1 := NA]
# Warning message:
#   In `[.data.table`(DT, V1 == "B", `:=`(V1, NA)) :
#   Coerced 'logical' RHS to 'character' to match the column's type. Either change the target column to 'logical' first (by creating a new 'logical' vector length 26 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'character' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.

identical(DT1,DT2)
# [1] TRUE

# First attempt at looping over data.table
cnames <- colnames(DT3)
system.time(for(cname in cnames) {
  DT3[ ,
      cname := gsub("B", NA, DT3[[cname]]),
      with=FALSE]
})
# user  system elapsed 
# 4.258   0.128   4.478 

identical(DT1$V1,DT3$V1)
# [1] TRUE

# Back to factors
DT3 <- DT3[, lapply(.SD, as.factor)]
Was it helpful?

Solution

Set the factor level to NA:

levels(DT$V1)[levels(DT$V1) == 'B'] <- NA

Example:

> d <- data.table(l=factor(LETTERS[1:3]))
> d
   l
1: A
2: B
3: C
> levels(d$l)[levels(d$l) == 'B'] <- NA
> d
    l
1:  A
2: NA
3:  C
> levels(d$l)
[1] "A" "C"

OTHER TIPS

You can change the levels as follows:

for (j in seq_along(DT)) {
    x  = DT[[j]]
    lx = levels(x)
    lx[lx == "B"] = NA
    setattr(x, 'levels', lx)      ## reset levels by reference
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top