Can Lucene store more than 100Gb original's documents in index?

Question

You would not want to store the raw document in a Lucene index, especially the size that you are talking about. I have done this a couple ways, but both ONLY store the indexed fields in the Lucene index and you have an ID/pointer to the raw document. I have dealt with indexes well over 100 million records and they work fine on a single server.

The reason this is important is that the build time of the index and manageability of the index dramatically drops if you don't need to store an additional 100 gig of data.

Basically, you need to index all the fields you need for searching/satisfying search queries. If a user clicks on the item in a grid, I assume you want to show the raw text (the UI pattern is that most of the time you will access a lot of the Lucene fields, but RARELY need to pull down the full binary text file).

The raw access I have used in conjunction with Lucene is:

SQL Server FILESTREAM, which is optimized for large binary file storage. It is really fast too. Not sure if MySQL has this (never worked with it)
Azure Table Storage, which is a key-value NoSQL cloud database. That was used to store the binary blobs.

It really doesn't matter what the persisted storage is, as long as it is optimized for larger binary files that can be accessed/streamed fast based off of a key. You could use an in-memory cache like Redis too as long as Lucene has the ID pointer to access the binary text file.