looking for twit and text message style stopwords

https://stackoverflow.com/questions/13558703

02-12-2021
|

Question

I have used R for mining the tweets and I got the most frequent words used in the tweets. However the most frequent words are like this:

 [1] "cant"     "dont"     "girl"     "gonna"    "lol"      "love"    
 [7] "que"      "thats"    "watching" "wish"     "youre"

I am looking for trends and names and events in the texts. I am wondering if there is a way to remove this text message style words (such as gonna,wanna, ...) from the corpus? Is there any stopwords for them? any help would be appreciated.

Solution

The text mining package maintains it's own list of stopwords and provides useful tools for managing and summarizing this type of text.

Let's say your tweets are stored in a vector.

library(tm)
words <- vector_of_strings
corpus <- Corpus(VectorSource(words))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, function(x) tolower(x))
corpus <- tm_map(corpus, function(x) removeWords(x, 
                stopwords()))

You can use the last line with your own list of stopwords():

stoppers <- c(stopwords(), "gonna", "wanna", "lol", ... )

Unfortunately, you'll have to generate your own list of "text messaging" or "internet messaging" stopwords.

But, you could cheat a bit, by borrowing from NetLingo ( http://vps.netlingo.com/acronyms.php )

library(XML)
theurl <- "http://vps.netlingo.com/acronyms.php"
h <- htmlParse(theurl)
h <- getNodeSet(h,"//ul/li/span//a")
stoppers <- sapply(h,xmlValue)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow