get the number of character vector elements in a corpus

https://stackoverflow.com//questions/22017512

21-12-2019
|

문제

my goal is to use R for lexicon based sentiment analysis!

i have two character vectors. one with positive words and one with negative words. e.g.

pos <- c("good", "accomplished", "won", "happy")
neg <- c("bad", "loss", "damaged", "sued", "disaster")

i now have a corpus of thousands of news articles and i want to know for each article, how many elements of my vectors pos and neg are in the article.

e.g. (not sure about how the corpus function works here but you get the idea: there are two articles in my corpus)

mycorpus <- Corpus("The CEO is happy that they finally won the case.", "The disaster caused a huge loss.")

i want to get something like this:

article 1: 2 element of pos and 0 element of neg
article 2: 0 elements of pos, 2 elements of neg

another good thing would be, if i can get the following for each article:

(number of pos words - number of neg words)/(number of total words in article)

thank you very much!!

EDIT:

@ Victorp: this doesn't seem to work

the matrix i get looks good:

mytdm[1:6,1:10]
               Docs
Terms          1 2 3 4 5 6 7 8 9 10
aaron          0 0 0 0 0 1 0 0 0  0
abandon        1 1 0 0 0 0 0 0 0  0
abandoned      0 0 0 3 0 0 0 0 0  0
abbey          0 0 0 0 0 0 0 0 0  0
abbott         0 0 0 0 0 0 0 0 0  0
abbotts        0 0 1 0 0 0 0 0 0  0

but when i do your command i get zero for every document!

colSums(mytdm[rownames(mytdm) %in% pos, ])
   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
  16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0

why is that??

해결책

Hello you can use the TermDocumentMatrix for doing that :

mycorpus <- Corpus(VectorSource(c("The CEO is happy that they finally won the case.", "The disaster caused a huge loss.")))
mytdm <- TermDocumentMatrix(mycorpus, control=list(removePunctuation=TRUE))
mytdm <- as.matrix(mytdm)

# Positive words
colSums(mytdm[rownames(mytdm) %in% pos, ])
1 2 
2 0 

# Negative words
colSums(mytdm[rownames(mytdm) %in% neg, ])
1 2 
0 2 

# Total number of words per documents
colSums(mytdm)
1 2 
9 5

다른 팁

Here's another approach:

## pos <- c("good", "accomplished", "won", "happy")
## neg <- c("bad", "loss", "damaged", "sued", "disaster")
## 
## mycorpus <- Corpus(VectorSource(
##     list("The CEO is happy that they finally won the case.", 
##     "The disaster caused a huge loss.")))

library(qdap)
with(tm_corpus2df(mycorpus), termco(text, docs, list(pos=pos, neg=neg)))

##   docs word.count       pos       neg
## 1    1         10 2(20.00%)         0
## 2    2          6         0 2(33.33%)

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow