Question

from __future__ import division
import urllib
import json
from math import log


def hits(word1,word2=""):
    query = "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=%s"
    if word2 == "":
        results = urllib.urlopen(query % word1)
    else:
        results = urllib.urlopen(query % word1+" "+"AROUND(10)"+" "+word2)
    json_res = json.loads(results.read())
    google_hits=int(json_res['responseData']['cursor']['estimatedResultCount'])
    return google_hits


def so(phrase):
    num = hits(phrase,"excellent")
    #print num
    den = hits(phrase,"poor")
    #print den
    ratio = num / den
    #print ratio
    sop = log(ratio)
    return sop

print so("ugly product")

I need this code to calculate the Point wise Mutual Information which can be used to classify reviews as positive or negative. Basically I am using the technique specified by Turney(2002): http://acl.ldc.upenn.edu/P/P02/P02-1053.pdf as an example for an unsupervised classification method for sentiment analysis.

As explained in the paper, the semantic orientation of a phrase is negative if the phrase is more strongly associated with the word "poor" and positive if it is more strongly associated with the word "excellent".

The code above calculates the SO of a phrase. I use Google to calculate the number of hits and calculate the SO.(as AltaVista is now not there)

The values computed are very erratic. They don't stick to a particular pattern. For example SO("ugly product") turns out be 2.85462098541 while SO("beautiful product") is 1.71395061117. While the former is expected to be negative and the other positive.

Is there something wrong with the code? Is there an easier way to calculate SO of a phrase (using PMI) with any Python library,say NLTK? I tried NLTK but was not able to find any explicit method which computes the PMI.

No correct solution

OTHER TIPS

Generally, calculating PMI is tricky since the formula will change depending on the size of the ngram that you want to take into consideration:

Mathematically, for bigrams, you can simply consider:

log(p(a,b) / ( p(a) * p(b) ))

Programmatically, let's say you have calculated all the frequencies of the unigrams and bigrams in your corpus, you do this:

def pmi(word1, word2, unigram_freq, bigram_freq):
  prob_word1 = unigram_freq[word1] / float(sum(unigram_freq.values()))
  prob_word2 = unigram_freq[word2] / float(sum(unigram_freq.values()))
  prob_word1_word2 = bigram_freq[" ".join([word1, word2])] / float(sum(bigram_freq.values()))
  return math.log(prob_word1_word2/float(prob_word1*prob_word2),2) 

This is a code snippet from an MWE library but it's in its pre-development stage (https://github.com/alvations/Terminator/blob/master/mwe.py). But do note that it's for parallel MWE extraction, so here's how you can "hack" it to extract monolingual MWE:

$ wget https://dl.dropboxusercontent.com/u/45771499/mwe.py
$ printf "This is a foo bar sentence .\nI need multi-word expression from this text file.\nThe text file is messed up , I know you foo bar multi-word expression thingy .\n More foo bar is needed , so that the text file is populated with some sort of foo bar bigrams to extract the multi-word expression ." > src.txt
$ printf "" > trg.txt
$ python
>>> import codecs
>>> from mwe import load_ngramfreq, extract_mwe

>>> # Calculates the unigrams and bigrams counts.
>>> # More superfluously, "Training a bigram 'language model'."
>>> unigram, bigram, _ , _ = load_ngramfreq('src.txt','trg.txt')

>>> sent = "This is another foo bar sentence not in the training corpus ."

>>> for threshold in range(-2, 4):
...     print threshold, [mwe for mwe in extract_mwe(sent.strip().lower(), unigram, bigram, threshold)]

[out]:

-2 ['this is', 'is another', 'another foo', 'foo bar', 'bar sentence', 'sentence not', 'not in', 'in the', 'the training', 'training corpus', 'corpus .']
-1 ['this is', 'is another', 'another foo', 'foo bar', 'bar sentence', 'sentence not', 'not in', 'in the', 'the training', 'training corpus', 'corpus .']
0 ['this is', 'foo bar', 'bar sentence']
1 ['this is', 'foo bar', 'bar sentence']
2 ['this is', 'foo bar', 'bar sentence']
3 ['foo bar', 'bar sentence']
4 []

For further details, i find this thesis an quick and easy introduction to MWE extraction: "Extending the Log Likelihood Measure to Improve Collocation Identification", see http://goo.gl/5ebTJJ

The Python library DISSECT contains a few methods to compute Pointwise Mutual Information on co-occurrence matrices.

Example:

#ex03.py
#-------
from composes.utils import io_utils
from composes.transformation.scaling.ppmi_weighting import PpmiWeighting

#create a space from co-occurrence counts in sparse format
my_space = io_utils.load("./data/out/ex01.pkl")

#print the co-occurrence matrix of the space
print my_space.cooccurrence_matrix

#apply ppmi weighting
my_space = my_space.apply(PpmiWeighting())

#print the co-occurrence matrix of the transformed space
print my_space.cooccurrence_matrix

Code on GitHub for the PMI methods.

Reference: Georgiana Dinu, Nghia The Pham, and Marco Baroni. 2013. DISSECT: DIStributional SEmantics Composition Toolkit. In Proceedings of the System Demonstrations of ACL 2013, Sofia, Bulgaria

Related: Calculating pointwise mutual information between two strings

To answer why your results are erratic, it is important to know that Google Search is not a dependable source for word frequencies. Frequencies as returned by the engine are mere estimations that are particularly inaccurate and possibly contradictory when querying for multiple words. This is not to bash Google, but it is not a utility for frequency counts. Therefore, your implementation may be fine, but the results on that basis can still be non-sensical.

For a more in-depth discussion of the matter, read "Googleology is bad science" by Adam Kilgarriff.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top