Question

I'm writing application what will be manipulate with more than 100Gb text documents. The size of each document is 2Kb-100Kb.

At first I supposed to use DBMS such as MySQL or Firebird to store raw documents with storing index in lucene's index. This approach have some disadvantages. For example, database transactions know nothing about lucene index and vice versa. So I need to synchronize them.

Then I supposed what Lucene can store entire documents in index. So I need regulary create index's backups. But it so easy: I can copy entire catalog with index. I use some kind of No SQL storage (i.e. Lucene). And I may don't use DBMS.

What is the best practice: to store original documents in index or not? I'm really don't want use DBMS to such purpose. Is it possible?

Was it helpful?

Solution

You would not want to store the raw document in a Lucene index, especially the size that you are talking about. I have done this a couple ways, but both ONLY store the indexed fields in the Lucene index and you have an ID/pointer to the raw document. I have dealt with indexes well over 100 million records and they work fine on a single server.

The reason this is important is that the build time of the index and manageability of the index dramatically drops if you don't need to store an additional 100 gig of data.

Basically, you need to index all the fields you need for searching/satisfying search queries. If a user clicks on the item in a grid, I assume you want to show the raw text (the UI pattern is that most of the time you will access a lot of the Lucene fields, but RARELY need to pull down the full binary text file).

The raw access I have used in conjunction with Lucene is:

  • SQL Server FILESTREAM, which is optimized for large binary file storage. It is really fast too. Not sure if MySQL has this (never worked with it)
  • Azure Table Storage, which is a key-value NoSQL cloud database. That was used to store the binary blobs.

It really doesn't matter what the persisted storage is, as long as it is optimized for larger binary files that can be accessed/streamed fast based off of a key. You could use an in-memory cache like Redis too as long as Lucene has the ID pointer to access the binary text file.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top