Question

I am new to python, nlp and and nltk, so please bear with me. I have a handful of articles (~200), that have been categorized by hand. I am looking to develop a heuristic to assist/ recommend categories. To start I was hoping to build a relationship between current categories and the words within the document.

My premise is that the nouns are more important to the category than any other part of speech. For example the category "Energy" probably is driven nearly completely through the nouns: oil, battery, wind, etc.

The first thing I wanted to do was tag the parts and evaluate them. On the first article I encountered some issues. Some of the tokens are bound to punctuation.

for articles in articles[1]:
    articles_id, content = articles
    clean = nltk.clean_html(content).replace('’', "'")
    tokens = nltk.word_tokenize(clean)
    pos_document = nltk.pos_tag(tokens)
    pos ={}
    for pos_word in pos_document:
        word, part = pos_word
        if pos.has_key(part):
            pos[part].append(word)
        else:
            pos[part] = [word]

Formatted output:

{
'VBG': ['continuing', 'paying', 'falling', 'starting'], 
'VBD': ['made', 'ended'], 'VBN': ['been', 'leaned', 'been', 'been'], 
'VBP': ['know', 'hasn', 'have', 'continue', 'expect', 'take', 'see', 'have', 'are'], 
'WDT': ['which', 'which'], 'JJ': ['negative', 'positive', 'top', 'modest', 'negative', 'real', 'financial', 'isn', 'important', 'long', 'short', 'next'], 
'VBZ': ['is', 'has', 'is', 'leads', 'is', 'is'], 'DT': ['Another', 'the', 'the', 'any', 'any', 'the', 'the', 'a', 'the', 'the', 'the', 'the', 'a', 'the', 'a', 'a', 'the', 'a', 'the', 'any'], 
'RP': ['back'], 
'NN': [ 'listless', 'day', 'rsquo', 'll', 'progress', 'rsquo', 't', 'news', 'season', 'corner', 'surprise', 'stock', 'line', 'growth', 'question', 
        'stop', 'engineering', 'growth', 'isn', 'rsquo', 't', 'rsquo', 't', 'stock', 'market', 'look', 'junk', 'bond', 'market', 'turning', 'junk', 
        'rock', 'history', 'guide', 't', 'day', '%', '%', '%', 'level', 'move', 'isn', 'rsquo', 't', 'indication', 'way'], 
',': [',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ','], '.': ['.'], 
'TO': ['to', 'to', 'to', 'to', 'to', 'to', 'to'], 
'PRP': ['them', 'they', 'they', 'we', 'you', 'they', 'it'], 
'RB': ['then', 'there', 'just', 'just', 'always', 'so', 'so', 'only', 'there', 'right', 'there', 'much', 'typically', 'far', 'certainly'], 
':': [';', ';', ';', ';', ';', ';', ';'], 
'NNS': ['folks', 'companies', 'estimates', 'covers', 's', 'equities', 'bonds', 'equities', 'flats'], 
'NNP': ['drift.', 'We', 'Monday', 'DC', 'note.', 'Earnings', 'EPS', 'same.', 'The', 'Street', 'now.', 'Since', 'points.', 'What', 'behind.', 'We', 'flat.', 'The'], 
'VB': ['get', 'manufacture', 'buy', 'boost', 'look', 'see', 'say', 'let', 'rsquo', 'rsquo', 'be', 'build', 'accelerate', 'be'], 
'WRB': ['when', 'where'], 
'CC': ['&', 'and', '&', 'and', 'and', 'or', 'and', '&', '&', '&', 'and', '&', 'and', 'but', '&'], 
'CD': ['47', '23', '30'], 
'EX': ['there'], 
'IN': ['on', 'if', 'until', 'of', 'around', 'as', 'on', 'down', 'since', 'of', 'for', 'under', 'that', 'about', 'at', 'at', 'that', 'like', 'if'], 
'MD': ['can', 'will', 'can', 'can', 'will'], 
'JJR': ['more']
}

notice under the NMP the word 'drift.' - shouldn't the period be removed? Do I need to remove this on my own or am I missing something with the library?

Was it helpful?

Solution

NLTK's word tokenizer assumes that its input has already been separated into sentences. Therefore in order to get it to work, you need to call sent_tokenize on your input first. I think you can use the output of sent_tokenize as the input to word_tokenize, but typically you would want to iterate over your sentences.

for articles in articles[1]:
    articles_id, content = articles
    clean = nltk.clean_html(content).replace('’', "'")
    sents = nltk.sent_tokenize(clean)
    pos ={}
    for sent in sents:
        tokens = nltk.word_tokenize(sent)
        pos_document = nltk.pos_tag(tokens)
        for pos_word in pos_document:
            word, part = pos_word
            if pos.has_key(part):
                pos[part].append(word)
            else:
                pos[part] = [word]

I believe the reason this is necessary is to help distinguish punctuation periods at the ends of sentences from periods used in abbreviations (i.e. you wouldn't want "Mr. Smith" to be broken into 'Mr', '.', 'Smith')

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top