Pergunta

How do you preserve factored variables in left joins using sqldf?

I am trying to perform a left join using the sqldf function in R; however, the process seems to convert a factored column in my "right" dataframe to a character class in the merged dataset.

I suspect that this is because the left join includes rows from the "left" dataframe for which there are no corresponding rows in the "right" dataframe, thus introducing NAs to the factored column.

I've created this reproducible example:

require(sqldf)
leftDF <- data.frame(A = sample(1:15, replace = FALSE), 
                     B = sample(letters, 15, replace = TRUE),
                     stringsAsFactors = FALSE)
str(leftDF)
rightDF <- data.frame(X = sample(1:5, 10, replace = TRUE),
                      Y = sample(letters, 10, replace = TRUE),
                      stringsAsFactors = TRUE)
str(rightDF)
mergedDF <- sqldf("SELECT l.A, l.B, r.Y 
                   FROM leftDF as l 
                   LEFT JOIN rightDF as r 
                   ON l.A = r.X")
str(mergedDF)

Is this the expected behavior of sqldf? The conversion of the factored variable to a character class may not be obvious to programmers until the variable doesn't behave they way they expect in future analyses.

I can preserve the factor by first adding an NA level to the factored column prior to the join using addNA(); however, adding NA as a level seems to be discouraged (see warning in ?addNA). Is there a better way of handling this?

Thanks in advance,

Jeff

An additional example to address comments:

require(sqldf)
leftDF <- data.frame(A = sample(1:15, replace = FALSE),
                     B = sample(letters, 15, replace = TRUE), 
                     stringsAsFactors = FALSE)
str(leftDF)
rightDF <- data.frame(X = sample(1:5, 10, replace = TRUE),
                      Y = sample(c("one","two","three","four","five","six"), 
                                 10, replace = TRUE), stringsAsFactors = FALSE)
rightDF$Y <- factor(rightDF$Y, levels = c("one","two","three","four","five","six"))
#rightDF$Y <- addNA(rightDF$Y)
table(rightDF$Y)
str(rightDF)
mergedDF <- sqldf("SELECT l.A, l.B, r.Y as Y__factor
                   FROM leftDF as l
                   LEFT JOIN rightDF as r
                   ON l.A = r.X")
str(mergedDF)
table(mergedDF$Y, useNA = c("always"))
Foi útil?

Solução

This is FAQ #1 on the sqldf home page.

In this case the components of mergeDF$Y are not all among the levels of rightDF$Y hence it can't use the latter's levels and so reverts back to using "character" class.

One can use the method argument in a number of ways to specify the result. See ?sqldf.

Alternately fix it up following the sqldf statement.

Here is an example:

# use one of the next two lines or some further variation depending on what you want
meth <- function(x) replace(x, "Y", factor(x$Y, levels(rightDF$Y)))
meth <- function(x) replace(x, "Y", factor(x$Y, c(levels(rightDF$Y), NA), exclude=NULL))

mergedDF <- sqldf("SELECT l.A, l.B B, r.Y
                   FROM leftDF as l 
                   LEFT JOIN rightDF as r 
                   ON l.A = r.X", method = meth) ## note use of method=meth

Outras dicas

just had (and solved) a similar problem using a 'sqldf' select in R. All my variables stayed the same (factor stayed factor, character stayed character and so on), BUT for one of my ordered factor variables, which became a character variable.

Checked and that was my only variable which had a missing value. So I made the missing value become a factor, and problem solved, the variable stayed the same after the 'sqldf' select :-) Hope it helps!

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top