Question

Consider following assumptions:

  1. I have Java 5.0 Web Application for which I'm considering to use Lucene 3.0 for full-text searching
  2. There will be more than 1000K Lucene documents, each with 100 words (average)
  3. New documents must be searchable just after they are created (real time search)
  4. Lucene documents have frequently updating integer field named quality

Where to find code examples (simple but as complete as possible) of near real time search of Lucene 3.0?

Is it possible to obtain query results sorted by one of document fields (quality) which may be updated frequently (for already indexed document)? Such updating of document field will have to trigger Lucene index rebuilding? What is performance of such rebuilding? How to done it efficiently - I need some examples / documentation of complete solution.

If, however, index rebuilding is not necessarily needed in this case - how to sort search results efficiently? There may be queries returning lots of documents (>50K), so I consider it unefficient to obtain them unsorted from Lucene and then sort them by quality field and finally divide sorted list to pages for pagination.

Is Lucene 3.0 my best choice within Java or should I consider some other frameworks/solutions? Maybe full text search provided by SQL Server itself (I'm using PostgreSQL 8.3)?

Was it helpful?

Solution

The Lucene API is capable of everything you're asking, but it won't be easy. It's a fairly low-level API, and making it do complicated things is quite an exercise in itself.

I can highly recommend Compass, which is a search/indexing framework built on top of Lucene. As well as a much friendlier API, it provides functionality such as object/XML/JSON mapping to Lucene indexes, as well as fully transactional behaviour. It should have no trouble with your requirements, such as realtime sorting of transactionally-updated documents.

Compass 2.2.0 is built upon Lucene 2.4.1, but a Lucene 3.0-based version is in the works. It's sufficiently abstracted from the Lucene API that the transition should be seamless, though.

OTHER TIPS

Near Real Time Search is available in Lucene since 2.9. Lucid Imagination has an article about this capability (before 2.9 release). The basic idea is you can now get an IndexReader from IndexWriter. If you refresh this IndexReader at regular interval, you get most up to the date changes from the IndexWriter.

Update: I haven't seen any code, but here is the broad idea.

All the nw document will be written to an IndexWriter, preferably created with RAMDirectory, which will will not be closed frequently. (To persist this in-memory index, you may have to flush it to disk ocassionally.)

You will have some indexes on the disk on which individual IndexReaders will be created. A MultiReader and a Searcher can be created on top of these Readers. One of the Reader will be from the in-memory index.

At regular interval (say a few seconds), you will remove current Reader from the MultiReader, get the new Reader from IndexWriter and construct the MultiReader/Searcher with new set of Readers.

According to the article from Lucid Imagination (linked above), they have tried writing 50 documents per second, without heavy slowdown.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top