Term extraction: Generatings tags out of text

https://stackoverflow.com/questions/1100549

11-09-2019
|

Question

How to get the same results as http://developer.yahoo.com/search/content/V1/termExtraction.html

This question has been asked quite a few times before.

Trying to approach this problem with existing solutions I stumbled upon "Text Analysis" Solr performs on the document before indexing as described in http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters - which includes stemming as well.

So the final index will consist mostly of terms used to describe the document.

Is there a solution that provides analyzers, tokenizers, and token filters for direct use? If solr is the way out, what is the best way get this data from solr's index?

Solution

Solr is a way to create a custom search engine. It does not seem to be the right tool for the job. The Wikipedia article about term extraction lists in its "external links" section several web applications for term extraction. OpenNLP has a list of tools which may be useful. Its Chunker may be helpful.

OTHER TIPS

Just ask for the parsed terms e.g.

http://localhost:8983/solr/terms?terms.fl=text&terms.sort=count&terms.limit=-1

See TermsComponent for more info.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow