Efficient keyword detection / extraction. Predefined set of keywords

https://stackoverflow.com/questions/4863396

27-10-2019
|

문제

How can I efficiently extract keywords with relevance from a string? My list of keywords are predefined. For example, in an article about Michelle Obama that also mentions Barack Obama, I want to extract Michelle Obama and Barack Obama with the keyword Michelle Obama getting a higher relevance value (both Michelle Obama and Barack Obama are present in my keywords list).

Checking the string for the number of occurrence of each keyword doesn't seem very efficient. My application is developed in PHP, but any language is ok, if I can do this efficiently.

I tried OpenCalais, but it is not detecting most of my keywords. Is it possible to extract keywords using Lucene?

해결책

The apache lucene package will suit you. However if you have title and paragraphs, you can filter out the stop words, give higher ranks for the words in the title and then match them or their forms in the paragraphs.. you can consult some text summarization articles for better programming yourself.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow