Question

With the tm package I'm able to do it like this:

c0 <- Corpus(VectorSource(text))
c0 <- tm_map(c0, removeWords, c(stopwords("english"),mystopwords))

mystopwords being a vector of the additional stopwords I want to remove.

But I can't find an equivalent way to do it using the RTextTools package. For example:

dtm <- create_matrix(text,language="english",
             removePunctuation=T,
             stripWhitespace=T,
             toLower=T,
             removeStopwords=T, #no clear way to specify a custom list here!
             stemWords=T)

Is it possible to do this? I really like the RTextTools interface and it would be a pity to have to move back to tm.

Était-ce utile?

La solution

There are three (or possible even more) solutions to your problem:

First, use the tm package only for removing words. Both packages deal with the same objects, therefore you can use tm just for removing words and than the RTextTools package. Even when you look inside the function create_matrix it uses tm functions.

Second, modify the create_matrix function. For example add an input parameter like own_stopwords=NULL and add the following lines:

# existing line
corpus <- Corpus(VectorSource(trainingColumn), 
                     readerControl = list(language = language))
# after that add this new line
if(!is.null(own_stopwords)) corpus <- tm_map(corpus, removeWords, 
                                          words=as.character(own_stopwords))

Third, write your own function, something like this:

# excluder function
remove_my_stopwords<-function(own_stw, dtm){
  ind<-sapply(own_stw, function(x, words){
    if(any(x==words)) return(which(x==words)) else return(NA)
  }, words=colnames(dtm))
  return(dtm[ ,-c(na.omit(ind))])  
}

let´s have a look if it works:

# let´s test it
data(NYTimes)
data <- NYTimes[sample(1:3100, size=10,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"], data["Subject"]))

head(colnames(matrix), 5)
# [1] "109"         "200th"       "abc"         "amid"        "anniversary"


# let´s consider some "own" stopwords as words above
ostw <- head(colnames(matrix), 5)

matrix2<-remove_my_stopwords(own_stw=ostw, dtm=matrix)

# check if they are still there
sapply(ostw, function(x, words) any(x==words), words=colnames(matrix2))
#109       200th         abc        amid anniversary 
#FALSE       FALSE       FALSE       FALSE       FALSE 

HTH

Autres conseils

You can add your stop words in the same list. For example:

c0 <- tm_map(c0, removeWords, c(stopwords("english"),"mystopwords"))
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top