Question

I want to do a sentiment analysis of German tweets. The code I use works fine with English, but when I load the German word list, all scores just result zero. As far as I can guess, it must have to do with the different structures of the word lists. So what I need to know is, how to adapt my code to the structure of the German word-list. Someone could take a look at both of the lists ?

English Wordlist
German Wordlist

    # load the wordlists
    pos.words = scan("~/positive-words.txt",what='character', comment.char=';')
    neg.words = scan("~/negative-words.txt",what='character', comment.char=';')

        # bring in the sentiment analysis algorithm
        # we got a vector of sentences. plyr will handle a list or a vector as an "l" 
        # we want a simple array of scores back, so we use "l" + "a" + "ply" = laply:
        score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
         { 
          require(plyr)
          require(stringr)
            scores = laply(sentences, function(sentence, pos.words, neg.words) 
            {
             # clean up sentences with R's regex-driven global substitute, gsub():
             sentence = gsub('[[:punct:]]', '', sentence)
             sentence = gsub('[[:cntrl:]]', '', sentence)
             sentence = gsub('\\d+', '', sentence)
             # and convert to lower case:
             sentence = tolower(sentence)
             # split into words. str_split is in the stringr package
             word.list = str_split(sentence, '\\s+')
             # sometimes a list() is one level of hierarchy too much
             words = unlist(word.list)
             # compare our words to the dictionaries of positive & negative terms
             pos.matches = match(words, pos.words)
             neg.matches = match(words, neg.words)
             # match() returns the position of the matched term or NA
             # we just want a TRUE/FALSE:
             pos.matches = !is.na(pos.matches)
             neg.matches = !is.na(neg.matches)
             # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
             score = sum(pos.matches) - sum(neg.matches)
             return(score)
            }, 
          pos.words, neg.words, .progress=.progress )
          scores.df = data.frame(score=scores, text=sentences)
          return(scores.df)
         }

    # and to see if it works, there should be a score...either in German or in English
    sample = c("ich liebe dich. du bist wunderbar","I hate you. Die!");sample
    test.sample = score.sentiment(sample, pos.words, neg.words);test.sample
Was it helpful?

Solution

This may work for you:

readAndflattenSentiWS <- function(filename) { 
  words = readLines(filename, encoding="UTF-8")
  words <- sub("\\|[A-Z]+\t[0-9.-]+\t?", ",", words)
  words <- unlist(strsplit(words, ","))
  words <- tolower(words)
  return(words)
}
pos.words <- c(scan("positive-words.txt",what='character', comment.char=';', quiet=T), 
               readAndflattenSentiWS("SentiWS_v1.8c_Positive.txt"))
neg.words <- c(scan("negative-words.txt",what='character', comment.char=';', quiet=T), 
              readAndflattenSentiWS("SentiWS_v1.8c_Negative.txt"))

score.sentiment = function(sentences, pos.words, neg.words, .progress='none') {
  # ... see OP ...
}

sample <- c("ich liebe dich. du bist wunderbar",
            "Ich hasse dich, geh sterben!", 
            "i love you. you are wonderful.",
            "i hate you, die.")
(test.sample <- score.sentiment(sample, 
                                pos.words, 
                                neg.words))
#   score                              text
# 1     2 ich liebe dich. du bist wunderbar
# 2    -2      ich hasse dich, geh sterben!
# 3     2    i love you. you are wonderful.
# 4    -2                  i hate you, die.

OTHER TIPS

In the German list the list are with this names: SentiWS_v1.8c_Negative.txt, and SentiWS_v1.8c_Positive.txt No in the way you are loading, this only works for the English version:

pos.words = scan("~/positive-words.txt",what='character', comment.char=';')
neg.words = scan("~/negative-words.txt",what='character', comment.char=';')

Apart from that the list are in different format:
The German version, is like that:

 Abbau|NN   -0.058  Abbaus,Abbaues,Abbauen,Abbaue  
 Abbruch|NN -0.0048 Abbruches,Abbrüche,Abbruchs,Abbrüchen  
 Abdankung|NN   -0.0048 Abdankungen
 Abdämpfung|NN  -0.0048 Abdämpfungen  
 Abfall|NN  -0.0048 Abfalles,Abfälle,Abfalls,Abfällen  
 Abfuhr|NN  -0.3367 Abfuhren  

The English version:

charismatic
charitable
charm
charming
charmingly
chaste
cheaper
cheapest

The German ones follow this pattern: word|NN\tnumber <similar words comma separated>\n
The English ones follow this pattern word\n
And the heading of each document is different so you might want to skip the heading (In the English list seems an article, not tweets, or words of tweets)

Solution, get the format of the two files to be the same, and then do whatever you want or prepare your code to read from two types of data.
Now you have your program working for the English version, so I suggest to change the format of the German list. You could change each space or comma for a \n and then eliminate all the |NN and numbers.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top