Question

I am new to NLP and NLTK, and I want to find ambiguous words, meaning words with at least n different tags. I have this method, but the output is more than confusing.

Code:

def MostAmbiguousWords(words, n):
# wordsUniqeTags holds a list of uniqe tags that have been observed for a given word
wordsUniqeTags = {}
for (w,t) in words:
    if wordsUniqeTags.has_key(w):
        wordsUniqeTags[w] = wordsUniqeTags[w] | set(t)
    else:
        wordsUniqeTags[w] = set([t])
# Starting to count
res = []
for w in wordsUniqeTags:
    if len(wordsUniqeTags[w]) >= n:
        res.append((w, wordsUniqeTags[w]))

return res
MostAmbiguousWords(brown.tagged_words(), 13)

Output:

[("what's", set(['C', 'B', 'E', 'D', 'H', 'WDT+BEZ', '-', 'N', 'T', 'W', 'V', 'Z', '+'])),
("who's", set(['C', 'B', 'E', 'WPS+BEZ', 'H', '+', '-', 'N', 'P', 'S', 'W', 'V', 'Z'])),
("that's", set(['C', 'B', 'E', 'D', 'H', '+', '-', 'N', 'DT+BEZ', 'P', 'S', 'T', 'W', 'V', 'Z'])),
('that', set(['C', 'D', 'I', 'H', '-', 'L', 'O', 'N', 'Q', 'P', 'S', 'T', 'W', 'CS']))]

Now I have no idea what B,C,Q, ect. could represent. So, my questions:

  • What are these?
  • What do they mean? (In case they are tags)
  • I think they are not tags, because who and whats don't have the WH tag indicating "wh question words".

I'll be happy if someone could post a link that includes a mapping of all possible tags and their meaning.

Was it helpful?

Solution

It looks like you have a typo. In this line:

wordsUniqeTags[w] = wordsUniqeTags[w] | set(t)

you should have set([t]) (not set(t)), like you do in the else case.

This explains the behavior you're seeing because t is a string and set(t) is making a set out of each character in the string. What you want is set([t]) which makes a set that has t as its element.

>>> t = 'WHQ'
>>> set(t)
set(['Q', 'H', 'W'])    # bad
>>> set([t])
set(['WHQ'])            # good

By the way, you can correct the problem and simplify things by just changing that line to:

wordsUniqeTags[w].add(t)

But, really, you should make use of the setdefault method on dict and list comprehension syntax to improve the method overall. So try this instead:

def most_ambiguous_words(words, n):
  # wordsUniqeTags holds a list of uniqe tags that have been observed for a given word
  wordsUniqeTags = {}
  for (w,t) in words:
    wordsUniqeTags.setdefault(w, set()).add(t)
  # Starting to count
  return [(word,tags) for word,tags in wordsUniqeTags.iteritems() if len(tags) >= n]

OTHER TIPS

You are splitting your POS tags into single characters in this line:

    wordsUniqeTags[w] = wordsUniqeTags[w] | set(t)

set('AT') results in set(['A', 'T']).

How about making use of the Counter and defaultdict functionality in the collections module?

from collection import defaultdict, Counter

def most_ambiguous_words(words, n):
    counts = defaultdict(Counter)
    for (word,tag) in words:
        counts[word][tag] += 1
    return [(w, counts[w].keys()) for w in counts if len(counts[word]) > n]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top