Question

Is there any java library that with given text (title) gets collection of important words in it.
EDITED: By important I mean the one that has define the main idea of the sentence. Thank You.

Was it helpful?

Solution

You might want to take a look at Apache Mahout.

You also might want to read more on tf-idf model which is often used for cases similar to the one you describe.

EDIT: more info on Tf-Idf model:

The tf-idf model basically says 2 things:

  1. If a term appears many times in your data, it is probably important. [tf]
  2. If a term appears many times in the world, an appearance of it in your data is expected - however, if it is rare - and it appears in your data - it indicates it is a very "important" [idf]

The tf-idf model utilize this assumptions and gives a rating for each term according to the tf,idf values.
To find the idf value you might want to index your collection or use some search engine API and estimate how common each term is, based on the number of results [note that the number returned by the engine is not exact, but it might be used as a rough estimation]

OTHER TIPS

Topic models try to do this for documents (or collections of documents). I doubt you can do much with individual sentences.

You could try using a semantic parser (eg RelEx) to try to get the main subject/object/etc, but it's still not that straightforward.

Some examples of what you are trying to do would help. "define the main idea" is still pretty vague - what type of sentences are you dealing with?

Considering you are working exclusively with titles, I would imagine pretty much any word that is not a stop word is important.

Perhaps you are just looking for a basic stop word removal algorithm, rather than a full blown text analysis algorithm?

Just depends how complex or "smart" you need this thing to be.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top