Question

I am looking for a simple way to grab a list with the 5-10 most important terms that describe a particular document. It could be even based on the specific field, say item description.

I thought this should be rather easy. Solr is anyway grading each term based on its relative number of occurrences in the document vs its overall occurrence in all documents (tf-idf)

Yet, I couldn't find a way how to pass a document I'd to Solr and grab the list of terms that I want.

Was it helpful?

Solution

If you just need the top terms from a document you can use Term Vector Component, assuming that your field has termVectors="true" You can can ask for tv.tf_idf and take the top n terms with the highest score.

OTHER TIPS

You might be looking for a MoreLikeThis component, specifically with the mlt.interestingTerms flag enabled.

I think you might want to go after certain types of words, typically nouns are used for this. I did something like this for a clustering routine once, where I used OpenNLP part of speech tagger to extract all noun phrases (using the chunker or part of speech tagger), then simply put each term in a HashMap. Here is some code that uses Sentence chunking, but doing it with straight Parts of speech will probably be a trivial adaptation (but let me know if you need help). What the code does is extracts each part of speech, then chunks the part of speech, loops over the chunks to get the noun phrases, then adds then to a term frequency hashmap. Really simple. You can optionally skip all the OpenNLP stuff, but you will want to do a lot of noise removal etc. Anyway... have a look.

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.HashMap;
import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.Span;

/**
 *
 * Extracts noun phrases from a sentence. To create sentences using OpenNLP use
 * the SentenceDetector classes.
 */
public class OpenNLPNounPhraseExtractor {

  static final int N = 2;

  public static void main(String[] args) {

    try {
      HashMap<String, Integer> termFrequencies = new HashMap<>();
      String modelPath = "c:\\temp\\opennlpmodels\\";
      TokenizerModel tm = new TokenizerModel(new FileInputStream(new File(modelPath + "en-token.zip")));
      TokenizerME wordBreaker = new TokenizerME(tm);
      POSModel pm = new POSModel(new FileInputStream(new File(modelPath + "en-pos-maxent.zip")));
      POSTaggerME posme = new POSTaggerME(pm);
      InputStream modelIn = new FileInputStream(modelPath + "en-chunker.zip");
      ChunkerModel chunkerModel = new ChunkerModel(modelIn);
      ChunkerME chunkerME = new ChunkerME(chunkerModel);
      //this is your sentence
      String sentence = "Barack Hussein Obama II  is the 44th awesome President of the United States, and the first African American to hold the office.";
      //words is the tokenized sentence
      String[] words = wordBreaker.tokenize(sentence);
      //posTags are the parts of speech of every word in the sentence (The chunker needs this info of course)
      String[] posTags = posme.tag(words);
      //chunks are the start end "spans" indices to the chunks in the words array
      Span[] chunks = chunkerME.chunkAsSpans(words, posTags);
      //chunkStrings are the actual chunks
      String[] chunkStrings = Span.spansToStrings(chunks, words);
      for (int i = 0; i < chunks.length; i++) {
        String np = chunkStrings[i];
        if (chunks[i].getType().equals("NP")) {
          if (termFrequencies.containsKey(np)) {
            termFrequencies.put(np, termFrequencies.get(np) + 1);
          } else {
            termFrequencies.put(np, 1);
          }
        }
      }
      System.out.println(termFrequencies);

    } catch (IOException e) {
    }
  }

}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top