Removing an “empty” character item from a corpus of documents in R?
-
06-06-2021 - |
Pregunta
I am using the tm
and lda
packages in R to topic model a corpus of news articles. However, I am getting a "non-character" problem represented as ""
that is messing up my topics. Here is my workflow:
text <- Corpus(VectorSource(d$text))
newtext <- lapply(text, tolower)
sw <- c(stopwords("english"), "ahram", "online", "egypt", "egypts", "egyptian")
newtext <- lapply(newtext, function(x) removePunctuation(x))
newtext <- lapply(newtext, function(x) removeWords(x, sw))
newtext <- lapply(newtext, function(x) removeNumbers(x))
newtext <- lapply(newtext, function(x) stripWhitespace(x))
d$processed <- unlist(newtext)
corpus <- lexicalize(d$processed)
k <- 40
result <-lda.collapsed.gibbs.sampler(corpus$documents, k, corpus$vocab, 500, .02, .05,
compute.log.likelihood = TRUE, trace = 2L)
Unfortunately, when I train the lda model, everything looks great except the most frequently occurring word is "". I try to remedy this by removing it from the vocab as given below and reestimating the model just as above:
newtext <- lapply(newtext, function(x) removeWords(x, ""))
But, it's still there, as evidenced by:
str_split(newtext[[1]], " ")
[[1]]
[1] "" "body" "mohamed" "hassan"
[5] "cook" "found" "turkish" "search"
[9] "rescue" "teams" "rescued" "hospital"
[13] "rescue" "teams" "continued" "search"
[17] "missing" "body" "cook" "crew"
[21] "wereegyptians" "sudanese" "syrians" "hassan"
[25] "cook" "cargo" "ship" "sea"
[29] "bright" "crashed" "thursday" "port"
[33] "antalya" "southern" "turkey" "vessel"
[37] "collided" "rocks" "port" "thursday"
[41] "night" "result" "heavy" "winds"
[45] "waves" "crew" ""
Any suggestions on how to go about removing this? Adding ""
to my list of stopwords doesn't help, either.
Solución
I deal with text a lot but not tm so this is 2 approaches to get rid of the "" you have. likely the extra "" characters are because of a double space bar between sentences. You can treat this condition before or after you turn the text into a bag of words. You could replace all " "x2 with " "x1 before the strsplit or you could do it afterward (you have to unlist after strsplit).
x <- "I like to ride my bicycle. Do you like to ride too?"
#TREAT BEFORE(OPTION):
a <- gsub(" +", " ", x)
strsplit(a, " ")
#TREAT AFTER OPTION:
y <- unlist(strsplit(x, " "))
y[!y%in%""]
You might also try:
newtext <- lapply(newtext, function(x) gsub(" +", " ", x))
Again I don't use tm so this may not be of help but this post hadn't seen any action so I figured I'd share possibilities.
Otros consejos
If you already have the corpus set up, try using the document length as a filter by attaching it to meta() as a tag and then creating a new corpus.
dtm <- DocumentTermMatrix(corpus)
## terms per document
doc.length = rowSums(as.matrix(dtm))
## add length as description term
meta(corpus.clean.noTL,tag="Length") <- doc.length
## create new corpus
corpus.noEmptyDocs <- tm_filter(corpus, FUN = sFilter, "Length > 0")
## remove Length as meta tag
meta(corpus.clean.noTL,tag="Length") <- NULL
With the above method, you can computationally efficiently hijack the existing matrix manipulation support in tm with only 5 lines of code.