I am new to python, nlp and and nltk, so please bear with me. I have a handful of articles (~200), that have been categorized by hand. I am looking to develop a heuristic to assist/ recommend categories. To start I was hoping to build a relationship between current categories and the words within the document.
My premise is that the nouns are more important to the category than any other part of speech. For example the category "Energy" probably is driven nearly completely through the nouns: oil, battery, wind, etc.
The first thing I wanted to do was tag the parts and evaluate them. On the first article I encountered some issues. Some of the tokens are bound to punctuation.
for articles in articles[1]:
articles_id, content = articles
clean = nltk.clean_html(content).replace('’', "'")
tokens = nltk.word_tokenize(clean)
pos_document = nltk.pos_tag(tokens)
pos ={}
for pos_word in pos_document:
word, part = pos_word
if pos.has_key(part):
pos[part].append(word)
else:
pos[part] = [word]
Formatted output:
{
'VBG': ['continuing', 'paying', 'falling', 'starting'],
'VBD': ['made', 'ended'], 'VBN': ['been', 'leaned', 'been', 'been'],
'VBP': ['know', 'hasn', 'have', 'continue', 'expect', 'take', 'see', 'have', 'are'],
'WDT': ['which', 'which'], 'JJ': ['negative', 'positive', 'top', 'modest', 'negative', 'real', 'financial', 'isn', 'important', 'long', 'short', 'next'],
'VBZ': ['is', 'has', 'is', 'leads', 'is', 'is'], 'DT': ['Another', 'the', 'the', 'any', 'any', 'the', 'the', 'a', 'the', 'the', 'the', 'the', 'a', 'the', 'a', 'a', 'the', 'a', 'the', 'any'],
'RP': ['back'],
'NN': [ 'listless', 'day', 'rsquo', 'll', 'progress', 'rsquo', 't', 'news', 'season', 'corner', 'surprise', 'stock', 'line', 'growth', 'question',
'stop', 'engineering', 'growth', 'isn', 'rsquo', 't', 'rsquo', 't', 'stock', 'market', 'look', 'junk', 'bond', 'market', 'turning', 'junk',
'rock', 'history', 'guide', 't', 'day', '%', '%', '%', 'level', 'move', 'isn', 'rsquo', 't', 'indication', 'way'],
',': [',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ','], '.': ['.'],
'TO': ['to', 'to', 'to', 'to', 'to', 'to', 'to'],
'PRP': ['them', 'they', 'they', 'we', 'you', 'they', 'it'],
'RB': ['then', 'there', 'just', 'just', 'always', 'so', 'so', 'only', 'there', 'right', 'there', 'much', 'typically', 'far', 'certainly'],
':': [';', ';', ';', ';', ';', ';', ';'],
'NNS': ['folks', 'companies', 'estimates', 'covers', 's', 'equities', 'bonds', 'equities', 'flats'],
'NNP': ['drift.', 'We', 'Monday', 'DC', 'note.', 'Earnings', 'EPS', 'same.', 'The', 'Street', 'now.', 'Since', 'points.', 'What', 'behind.', 'We', 'flat.', 'The'],
'VB': ['get', 'manufacture', 'buy', 'boost', 'look', 'see', 'say', 'let', 'rsquo', 'rsquo', 'be', 'build', 'accelerate', 'be'],
'WRB': ['when', 'where'],
'CC': ['&', 'and', '&', 'and', 'and', 'or', 'and', '&', '&', '&', 'and', '&', 'and', 'but', '&'],
'CD': ['47', '23', '30'],
'EX': ['there'],
'IN': ['on', 'if', 'until', 'of', 'around', 'as', 'on', 'down', 'since', 'of', 'for', 'under', 'that', 'about', 'at', 'at', 'that', 'like', 'if'],
'MD': ['can', 'will', 'can', 'can', 'will'],
'JJR': ['more']
}
notice under the NMP the word 'drift.' - shouldn't the period be removed? Do I need to remove this on my own or am I missing something with the library?