How do you preserve factored variables in left joins using sqldf?
I am trying to perform a left join using the sqldf function in R; however, the process seems to convert a factored column in my "right" dataframe to a character class in the merged dataset.
I suspect that this is because the left join includes rows from the "left" dataframe for which there are no corresponding rows in the "right" dataframe, thus introducing NAs to the factored column.
I've created this reproducible example:
require(sqldf)
leftDF <- data.frame(A = sample(1:15, replace = FALSE),
B = sample(letters, 15, replace = TRUE),
stringsAsFactors = FALSE)
str(leftDF)
rightDF <- data.frame(X = sample(1:5, 10, replace = TRUE),
Y = sample(letters, 10, replace = TRUE),
stringsAsFactors = TRUE)
str(rightDF)
mergedDF <- sqldf("SELECT l.A, l.B, r.Y
FROM leftDF as l
LEFT JOIN rightDF as r
ON l.A = r.X")
str(mergedDF)
Is this the expected behavior of sqldf? The conversion of the factored variable to a character class may not be obvious to programmers until the variable doesn't behave they way they expect in future analyses.
I can preserve the factor by first adding an NA level to the factored column prior to the join using addNA(); however, adding NA as a level seems to be discouraged (see warning in ?addNA). Is there a better way of handling this?
Thanks in advance,
Jeff
An additional example to address comments:
require(sqldf)
leftDF <- data.frame(A = sample(1:15, replace = FALSE),
B = sample(letters, 15, replace = TRUE),
stringsAsFactors = FALSE)
str(leftDF)
rightDF <- data.frame(X = sample(1:5, 10, replace = TRUE),
Y = sample(c("one","two","three","four","five","six"),
10, replace = TRUE), stringsAsFactors = FALSE)
rightDF$Y <- factor(rightDF$Y, levels = c("one","two","three","four","five","six"))
#rightDF$Y <- addNA(rightDF$Y)
table(rightDF$Y)
str(rightDF)
mergedDF <- sqldf("SELECT l.A, l.B, r.Y as Y__factor
FROM leftDF as l
LEFT JOIN rightDF as r
ON l.A = r.X")
str(mergedDF)
table(mergedDF$Y, useNA = c("always"))