Pergunta

I would like to find a good way of identifying names of people, places, etc. within users search queries on my site. For example, if a user asks "how old is George Washington", I need to be able to know from a predefined list that George Washington is a person.

Some of the lists will be global, and some will be user specific. For example, if they asked "how old is John Smith" I may only want to identify the particular John Smith that is my associate--and I wouldn't want to identify him as a person if he's not my associate.

Is there any NLP library or crawling of these lists I could do to leverage Soundx, mature NLP, misspell, etc. functionality? I can write it by hand, but I would rather leverage something mature. Thanks.

Foi útil?

Solução

What you need is called Named Entity Recognition

One of the best available software to do it comes with Stanford NLP: http://nlp.stanford.edu/software/CRF-NER.shtml (written in Java)

If you are on another platform, there are good open source projects in Ruby and Python. Search for "Named Entity Recognition".

Outras dicas

The particular Natural Language Processing (NLP) task that you're looking for is called Named Entity Recognition (NER)

Other than the Stanford's CRF-NER (in java), a popular python choice from Natural Language ToolKit (NLTK) is often used as a baseline for NER tasks.

You can try installing NLTK then execute the following code:

>>> from nltk.tokenize import word_tokenize
>>> from nltk.tag import pos_tag
>>> from nltk.chunk import ne_chunk
>>> sentence = "How old is John Smith?"
>>> ne_chunk(pos_tag(word_tokenize(sentence)))
Tree('S', [('How', 'WRB'), ('old', 'JJ'), ('is', 'VBZ'), Tree('PERSON', [('John', 'NNP'), ('Smith', 'NNP')]), ('?', '.')])
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top