Removing an “empty” character item from a corpus of documents in R?

https://stackoverflow.com/questions/10488343

06-06-2021
|

Pregunta

I am using the tm and lda packages in R to topic model a corpus of news articles. However, I am getting a "non-character" problem represented as "" that is messing up my topics. Here is my workflow:

text <- Corpus(VectorSource(d$text))
newtext <- lapply(text, tolower)
sw <- c(stopwords("english"), "ahram", "online", "egypt", "egypts", "egyptian")
newtext <- lapply(newtext, function(x) removePunctuation(x))
newtext <- lapply(newtext, function(x) removeWords(x, sw))
newtext <- lapply(newtext, function(x) removeNumbers(x))
newtext <- lapply(newtext, function(x) stripWhitespace(x))
d$processed <- unlist(newtext)
corpus <- lexicalize(d$processed)
k <- 40
result <-lda.collapsed.gibbs.sampler(corpus$documents, k, corpus$vocab, 500, .02, .05,
compute.log.likelihood = TRUE, trace = 2L)

Unfortunately, when I train the lda model, everything looks great except the most frequently occurring word is "". I try to remedy this by removing it from the vocab as given below and reestimating the model just as above:

newtext <- lapply(newtext, function(x) removeWords(x, ""))

But, it's still there, as evidenced by:

str_split(newtext[[1]], " ")

[[1]]
 [1] ""              "body"          "mohamed"       "hassan"       
 [5] "cook"          "found"         "turkish"       "search"       
 [9] "rescue"        "teams"         "rescued"       "hospital"     
[13] "rescue"        "teams"         "continued"     "search"       
[17] "missing"       "body"          "cook"          "crew"         
[21] "wereegyptians" "sudanese"      "syrians"       "hassan"       
[25] "cook"          "cargo"         "ship"          "sea"          
[29] "bright"        "crashed"       "thursday"      "port"         
[33] "antalya"       "southern"      "turkey"        "vessel"       
[37] "collided"      "rocks"         "port"          "thursday"     
[41] "night"         "result"        "heavy"         "winds"        
[45] "waves"         "crew"          ""

Any suggestions on how to go about removing this? Adding "" to my list of stopwords doesn't help, either.

Solución

I deal with text a lot but not tm so this is 2 approaches to get rid of the "" you have. likely the extra "" characters are because of a double space bar between sentences. You can treat this condition before or after you turn the text into a bag of words. You could replace all " "x2 with " "x1 before the strsplit or you could do it afterward (you have to unlist after strsplit).

x <- "I like to ride my bicycle.  Do you like to ride too?"

#TREAT BEFORE(OPTION):
a <- gsub(" +", " ", x)
strsplit(a,  " ")

#TREAT AFTER OPTION:
y <- unlist(strsplit(x, " "))
y[!y%in%""]

You might also try:

newtext <- lapply(newtext, function(x) gsub(" +", " ", x))

Again I don't use tm so this may not be of help but this post hadn't seen any action so I figured I'd share possibilities.

Otros consejos

If you already have the corpus set up, try using the document length as a filter by attaching it to meta() as a tag and then creating a new corpus.

dtm <- DocumentTermMatrix(corpus)

## terms per document
doc.length = rowSums(as.matrix(dtm))

## add length as description term
meta(corpus.clean.noTL,tag="Length") <- doc.length

## create new corpus
corpus.noEmptyDocs <- tm_filter(corpus, FUN = sFilter, "Length > 0")

## remove Length as meta tag
meta(corpus.clean.noTL,tag="Length") <- NULL

With the above method, you can computationally efficiently hijack the existing matrix manipulation support in tm with only 5 lines of code.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow