سؤال

How can I efficiently extract keywords with relevance from a string? My list of keywords are predefined. For example, in an article about Michelle Obama that also mentions Barack Obama, I want to extract Michelle Obama and Barack Obama with the keyword Michelle Obama getting a higher relevance value (both Michelle Obama and Barack Obama are present in my keywords list).

Checking the string for the number of occurrence of each keyword doesn't seem very efficient. My application is developed in PHP, but any language is ok, if I can do this efficiently.

I tried OpenCalais, but it is not detecting most of my keywords. Is it possible to extract keywords using Lucene?

هل كانت مفيدة؟

المحلول

The apache lucene package will suit you. However if you have title and paragraphs, you can filter out the stop words, give higher ranks for the words in the title and then match them or their forms in the paragraphs.. you can consult some text summarization articles for better programming yourself.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top