get the number of character vector elements in a corpus

https://stackoverflow.com//questions/22017512

21-12-2019
|

Question

my goal is to use R for lexicon based sentiment analysis!

i have two character vectors. one with positive words and one with negative words. e.g.

pos <- c("good", "accomplished", "won", "happy")
neg <- c("bad", "loss", "damaged", "sued", "disaster")

i now have a corpus of thousands of news articles and i want to know for each article, how many elements of my vectors pos and neg are in the article.

e.g. (not sure about how the corpus function works here but you get the idea: there are two articles in my corpus)

mycorpus <- Corpus("The CEO is happy that they finally won the case.", "The disaster caused a huge loss.")

i want to get something like this:

article 1: 2 element of pos and 0 element of neg
article 2: 0 elements of pos, 2 elements of neg

another good thing would be, if i can get the following for each article:

(number of pos words - number of neg words)/(number of total words in article)

thank you very much!!

EDIT:

@ Victorp: this doesn't seem to work

the matrix i get looks good:

mytdm[1:6,1:10]
               Docs
Terms          1 2 3 4 5 6 7 8 9 10
aaron          0 0 0 0 0 1 0 0 0  0
abandon        1 1 0 0 0 0 0 0 0  0
abandoned      0 0 0 3 0 0 0 0 0  0
abbey          0 0 0 0 0 0 0 0 0  0
abbott         0 0 0 0 0 0 0 0 0  0
abbotts        0 0 1 0 0 0 0 0 0  0

but when i do your command i get zero for every document!

colSums(mytdm[rownames(mytdm) %in% pos, ])
   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
  16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0

why is that??

Solution

Hello you can use the TermDocumentMatrix for doing that :

mycorpus <- Corpus(VectorSource(c("The CEO is happy that they finally won the case.", "The disaster caused a huge loss.")))
mytdm <- TermDocumentMatrix(mycorpus, control=list(removePunctuation=TRUE))
mytdm <- as.matrix(mytdm)

# Positive words
colSums(mytdm[rownames(mytdm) %in% pos, ])
1 2 
2 0 

# Negative words
colSums(mytdm[rownames(mytdm) %in% neg, ])
1 2 
0 2 

# Total number of words per documents
colSums(mytdm)
1 2 
9 5

OTHER TIPS

Here's another approach:

## pos <- c("good", "accomplished", "won", "happy")
## neg <- c("bad", "loss", "damaged", "sued", "disaster")
## 
## mycorpus <- Corpus(VectorSource(
##     list("The CEO is happy that they finally won the case.", 
##     "The disaster caused a huge loss.")))

library(qdap)
with(tm_corpus2df(mycorpus), termco(text, docs, list(pos=pos, neg=neg)))

##   docs word.count       pos       neg
## 1    1         10 2(20.00%)         0
## 2    2          6         0 2(33.33%)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow