Question

Using Stanford's sentiment classification routine (contained in the CoreNLP toolkit), I am trying to plot the "sentiment" of each sentence in a given document. The data from the sentiment classification essentially consists of five columns and n rows:

0.0374 0.1311 0.1502 0.5761 0.1052
0.0117 0.0301 0.1748 0.5980 0.1854
0.1261 0.7332 0.1182 0.0156 0.0069

Each row represents a sentence in the file fed to the classifier, and each of the columns in that row represents the confidence intervals regarding the sentence's sentiment value: The first column contains the algorithm's confidence that the given sentence has "very negative" sentiment, the second column contains the confidence that the sentence has "somewhat negative" sentiment, the third column represents the confidence that the sentence contains "no sentiment" (i.e. is descriptive); the fourth column represents the confidence that the sentence is "positive", and the fifth column represents the confidence that the sentence is "very positive".

For each row in the data, it's easy enough to identify the column with the maximum value, and then plot those values in sequential order, using negative values if the maximum value in the row falls in the first or second columns, zero if the max is column three, and positive values if the maximum value falls in the fourth or fifth rows:

enter image description here

If I only plot the sentiment scores with the largest confidence values in each row, though, (which is what I've done in the plot above), I end up throwing out four of five columns of data for each row. Is it possible to represent all the rows of this data in a reasonably intuitive fashion using ggplot2? I realize that this question is borderline off-topic on SO, but I thought that others with more familiarity with ggplot (and dataviz more broadly) might be able to point me towards a better visualization method for my data structure. In any event, I would be eager to hear others' thoughts on this question.

Was it helpful?

Solution

So here is an expansion on my comment, using your Romeo and Juliet dataset.

Rather than using the maximum probability for each sentence as a surrogate for sentiment, you could use the weighted average, or "expected sentiment". This is calculated, for each sentence, as:

L = [-2,-1,0,1,2]

E(Si) = Σ ( pi,j × Lj )

where i,j is the row,column number and Lj is the jth element of L.

You could also calculate the uncertainty in sentiment for each sentence as:

V(Si) = Σ pi,j × [ Lj - E(Si) ]2

In R code:

library(ggplot2)
library(reshape2)
P <- read.csv("romeo.and.juliet.txt",sep=" ")   # file you provided
P <- as.matrix(P)                               # needs to be a matrix
# calculate expected sentiment, E(s) based on Likert scale
# E(S) = sum(P_i * i)  [i in -2:2]
L <- c(-2,-1,0,1,2)    # Likert Scale
ES <- P %*% L          # E(S)
sentiment <- data.frame(n=1:length(ES),ES)
# calculate sentiment variability for each sentence
# V(S)  = sum(P_i * (i - E(s))^2)   [i in -2:2]
# SD(S) = sqrt(V(S))
LL <- matrix(rep(L,each=nrow(P)),ncol=ncol(P))
LL <- apply(LL,2,function(X)X-ES)
LL.sq <- LL^2
VS <- P %*% t(LL.sq)
SD <- sqrt(diag(VS))
sentiment$SD <- SD
# reshape for plotting w/ggplot
gg <- melt(sentiment,id="n")
ggplot(gg, aes(x=n,y=value,color=variable)) + 
  geom_point(size=1.5,alpha=.5) + 
  stat_smooth(method=loess, size=1)+
  facet_grid(variable~., scales="free_y")+
  scale_color_discrete(name="",labels=c("Expected Sentimant","Uncertainty"))+
  theme(legend.position="bottom")

This data looks completely random - which makes me question the classification method (is it appropriate for Elizabethan English?). Nevertheless this does illustrate the technique.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top