Question

I would like to get some frequently occurring phrases with Lucene. I am getting some information from TXT files, and I am losing a lot of context for not having information for phrases e.g. "information retrieval" is indexed as two separate words.

What is the way to get the phrases like this? I can not find anything useful on internet, all the advices, links, hints especially examples are appreciated!

EDIT: I store my documents just by title and content:

 Document doc = new Document();
 doc.add(new Field("name", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
 doc.add(new Field("text", fReader, Field.TermVector.WITH_POSITIONS_OFFSETS));

because for what I am doing the most important is the content of the file. Titles are too often not descriptive at all (e.g., I have many PDF academic papers whose titles are codes or numbers).

I desperately need to index top occurring phrases from text contents, just now I see how much this simple "bag of words" approach is not efficient.

Was it helpful?

Solution

Julia, It seems what you are looking for is n-grams, specifically Bigrams (also called collocations).

Here's a chapter about finding collocations (PDF) from Manning and Schutze's Foundations of Statistical Natural Language Processing.

In order to do this with Lucene, I suggest using Solr with ShingleFilterFactory. Please see this discussion for details.

OTHER TIPS

Is it possible for you to post any code that you have written?

Basically a lot depends on the way you create your fields and store documents in lucene.

Lets consider a case where I have got two fields: ID and Comments; and in my ID field I allow values like this 'finding nemo' i.e. strings with space. Whereas 'Comments' is a free flow text field i.e. I allow anything and everything which my keyboard allows and what lucene can understand.

Now in real life scenario it does not make sense to make my ID:'finding nemo' as two different searchable string. Whereas I want to index everything in Comments.

So what I will do is, I will create a document (org.apache.lucene.document.Document) object to take care of this... Something like this

Document doc = new Document();
doc.add(new Field("comments","Finding nemo was a very tough job for a clown fish ...", Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("id", "finding nemo", Field.Store.YES, Field.Index.NOT_ANALYZED));

So, essentially I have created two fields:

  1. comments: Where I have preferred to analyze it by using Field.Index.ANALYZED
  2. id: Where I directed lucene to store it but do not analyze it Field.Index.NOT_ANALYZED

This is how you customize lucene for Default Tokenizer and analyser. Otherwise you can write your own Tokenizer and analyzers.

Link(s) http://darksleep.com/lucene/

Hope this will help you... :)

Well the problem of losing the context for phrases can be solved by using PhraseQuery.

An index by default contains positional information of terms, as long as you did not create pure Boolean fields by indexing with the omitTermFreqAndPositions option. PhraseQuery uses this information to locate documents where terms are within a certain distance of one another.

For example, suppose a field contained the phrase “the quick brown fox jumped over the lazy dog”. Without knowing the exact phrase, you can still find this document by searching for documents with fields having quick and fox near each other. Sure, a plain TermQuery would do the trick to locate this document knowing either of those words, but in this case we only want documents that have phrases where the words are either exactly side by side (quick fox) or have one word in between (quick [irrelevant] fox). The maximum allowable positional distance between terms to be considered a match is called slop. Distance is the number of positional moves of terms to reconstruct the phrase in order.

Check out Lucene's JavaDoc for PhraseQuery

See this example code which demonstrates how to work with various Query Objects:

You can also try to combine various query types with the help of the BooleanQuery class.

And regarding the frequency of phrases, I suppose Lucene's scoring considers the frequency of the terms occurring in the documents.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top