Python - Sentiment Analysis using Pointwise Mutual Information

Question 1

Generally, calculating PMI is tricky since the formula will change depending on the size of the ngram that you want to take into consideration:

Mathematically, for bigrams, you can simply consider:

log(p(a,b) / ( p(a) * p(b) ))

Programmatically, let's say you have calculated all the frequencies of the unigrams and bigrams in your corpus, you do this:

def pmi(word1, word2, unigram_freq, bigram_freq):
  prob_word1 = unigram_freq[word1] / float(sum(unigram_freq.values()))
  prob_word2 = unigram_freq[word2] / float(sum(unigram_freq.values()))
  prob_word1_word2 = bigram_freq[" ".join([word1, word2])] / float(sum(bigram_freq.values()))
  return math.log(prob_word1_word2/float(prob_word1*prob_word2),2)

This is a code snippet from an MWE library but it's in its pre-development stage (https://github.com/alvations/Terminator/blob/master/mwe.py). But do note that it's for parallel MWE extraction, so here's how you can "hack" it to extract monolingual MWE:

$ wget https://dl.dropboxusercontent.com/u/45771499/mwe.py
$ printf "This is a foo bar sentence .\nI need multi-word expression from this text file.\nThe text file is messed up , I know you foo bar multi-word expression thingy .\n More foo bar is needed , so that the text file is populated with some sort of foo bar bigrams to extract the multi-word expression ." > src.txt
$ printf "" > trg.txt
$ python
>>> import codecs
>>> from mwe import load_ngramfreq, extract_mwe

>>> # Calculates the unigrams and bigrams counts.
>>> # More superfluously, "Training a bigram 'language model'."
>>> unigram, bigram, _ , _ = load_ngramfreq('src.txt','trg.txt')

>>> sent = "This is another foo bar sentence not in the training corpus ."

>>> for threshold in range(-2, 4):
...     print threshold, [mwe for mwe in extract_mwe(sent.strip().lower(), unigram, bigram, threshold)]

[out]:

-2 ['this is', 'is another', 'another foo', 'foo bar', 'bar sentence', 'sentence not', 'not in', 'in the', 'the training', 'training corpus', 'corpus .']
-1 ['this is', 'is another', 'another foo', 'foo bar', 'bar sentence', 'sentence not', 'not in', 'in the', 'the training', 'training corpus', 'corpus .']
0 ['this is', 'foo bar', 'bar sentence']
1 ['this is', 'foo bar', 'bar sentence']
2 ['this is', 'foo bar', 'bar sentence']
3 ['foo bar', 'bar sentence']
4 []

For further details, i find this thesis an quick and easy introduction to MWE extraction: "Extending the Log Likelihood Measure to Improve Collocation Identification", see http://goo.gl/5ebTJJ

Question 2

The Python library DISSECT contains a few methods to compute Pointwise Mutual Information on co-occurrence matrices.

Example:

#ex03.py
#-------
from composes.utils import io_utils
from composes.transformation.scaling.ppmi_weighting import PpmiWeighting

#create a space from co-occurrence counts in sparse format
my_space = io_utils.load("./data/out/ex01.pkl")

#print the co-occurrence matrix of the space
print my_space.cooccurrence_matrix

#apply ppmi weighting
my_space = my_space.apply(PpmiWeighting())

#print the co-occurrence matrix of the transformed space
print my_space.cooccurrence_matrix

Code on GitHub for the PMI methods.

Reference: Georgiana Dinu, Nghia The Pham, and Marco Baroni. 2013. DISSECT: DIStributional SEmantics Composition Toolkit. In Proceedings of the System Demonstrations of ACL 2013, Sofia, Bulgaria

Related: Calculating pointwise mutual information between two strings

Question 3

To answer why your results are erratic, it is important to know that Google Search is not a dependable source for word frequencies. Frequencies as returned by the engine are mere estimations that are particularly inaccurate and possibly contradictory when querying for multiple words. This is not to bash Google, but it is not a utility for frequency counts. Therefore, your implementation may be fine, but the results on that basis can still be non-sensical.

For a more in-depth discussion of the matter, read "Googleology is bad science" by Adam Kilgarriff.