Question

For a text analysis program, I would like to analyze the co-occurrence of certain words in a text. For example, I would like to see that e.g. the words "Barack" and "Obama" appear more often together (i.e. have a positive correlation) than others.

This does not seem to be that difficult. However, to be honest, I only know how to calculate the correlation between two numbers, but not between two words in a text.

  1. How can I best approach this problem?
  2. How can I calculate the correlation between words?

I thought of using conditional probabilities, since e.g. Barack Obama is much more probable than Obama Barack; however, the problem I try to solve is much more fundamental and does not depend on the ordering of the words

Was it helpful?

Solution

The Ngram Statistics Package (NSP) is devoted precisely to this task. They have a paper online which describes the association measures they use. I haven't used the package myself, so I cannot comment on its reliability/requirements.

OTHER TIPS

Well a simple way to solve your question is by shaping the data in a 2x2 matrix

            obama | not obama
barack      A       B
not barack  C       D

and score all occuring bi-grams in the matrix. That way you can for instance use simple chi squared.

I don't know how this is commonly done, but I can think of one crude way to define a notion of correlation that captures word adjacency.

Suppose the text has length N, say it is an array

text[0], text[1], ..., text[N-1]

Suppose the following words appear in the text

word[0], word[1], ..., word[k]

For each word word[i], define a vector of length N-1

X[i] = array(); // of  length N-1

as follows: the ith entry of the vector is 1 if the word is either the ith word or the (i+1)th word, and zero otherwise.

// compute the vector X[i]
for (j = 0:N-2){
  if (text[j] == word[i] OR text[j+1] == word[i])
    X[i][j] = 1;
  else
    X[i][j] = 0;
}

Then you can compute the correlation coefficient between word[a] and word[b] as the dot product between X[a] and X[b] (note that the dot product is the number of times these words are adjacent) divided by the lenghts (the length is the square root of the number of appearances of the word, well maybe twice that). Call this quantity COR(X[a],X[b]). Clearly COR(X[a],X[a]) = 1, and COR(X[a],X[b]) is larger if word[a], word[b] are often adjacent.

This can be generalized from "adjacent" to other notions of near - for example we could have chosen to use 3 word (or 4, 5, etc.) blocks instead. One can also add weights, probably do many more things as well if desired. One would have to experiment to see what is useful, if any of it is of use at all.

This problem sounds like a bigram, a sequence of two "tokens" in a larger body of text. See this Wikipedia entry, which has additional links to the more general n-gram problem.

If you want to do a full analysis, you'd most likely take any given pair of words and do a frequency analysis. E.g., the sentence "Barack Obama is the Democratic candidate for President," has 8 words, so there are 8 choose 2 = 28 possible pairs.

You can then ask statistical questions like, "in how many pairs does 'Obama' follow 'Barack', and in how many pairs does some other word (not 'Obama') follow 'Barack'? In this case, there are 7 pairs that include 'Barack' but in only one of them is it paired with 'Obama'.

Do the same for every possible word pair (e.g., "in how many pairs does 'candidate' follow 'the'?"), and you've got a basis for comparison.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top