library(tm)
library(Rstem)
data(crude)
set.seed(1)
spl <- runif(length(crude)) < 0.7
train <- crude[spl]
test <- crude[!spl]
controls <- list(
tolower = TRUE,
removePunctuation = TRUE,
stopwords = stopwords("english"),
stemming = function(word) wordStem(word, language = "english")
)
train_dtm <- DocumentTermMatrix(train, controls)
train_dtm <- removeSparseTerms(train_dtm, 0.8)
test_dtm <- DocumentTermMatrix(
test,
c(controls, dictionary = list(dimnames(train_dtm)$Terms))
)
## train_dtm
## A document-term matrix (13 documents, 91 terms)
##
## Non-/sparse entries: 405/778
## Sparsity : 66%
## Maximal term length: 9
## Weighting : term frequency (tf)
## test_dtm
## A document-term matrix (7 documents, 91 terms)
##
## Non-/sparse entries: 149/488
## Sparsity : 77%
## Maximal term length: 9
## Weighting : term frequency (tf)
## all(dimnames(train_dtm)$Terms == dimnames(test_dtm)$Terms)
## [1] TRUE
I had issues using the default stemmer. Also there is a bounds
option for controls, but I couldn't get the same results as removeSparseTerms
when using it. I tried bounds = list(local = c(0.2 * length(train), Inf))
with floor
and ceiling
with no luck.