Question

I am new to Natural Language Processing and I want to learn more by creating a simple project. NLTK was suggested to be popular in NLP so I will use it in my project.

Here is what I would like to do:

  • I want to scan our company's intranet pages; approximately 3K pages
  • I would like to parse and categorize the content of these pages based on certain criteria such as: HR, Engineering, Corporate Pages, etc...

From what I have read so far, I can do this with Named Entity Recognition. I can describe entities for each category of pages, train the NLTK solution and run each page through to determine the category.

Is this the right approach? I appreciate any direction and ideas...

Thanks

Was it helpful?

Solution

It looks like you want to do text/document classification, which is not quite the same as Named Entity Recognition, where the goal is to recognize any named entities (proper names, places, institutions etc) in text. However, proper names might be very good features when doing text classification in a limited domain, it is for example likely that a page with the name of the head engineer could be classified as Engineering.

The NLTK book has a chapter on basic text classification.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top