indexing data inside blob using Lucene.NET and C#

Question

It is not very clear what do you mean by I don't want to download same file again from blob and then index it, instead i will prefer download and keep it in main memory and directly use index filter? What is that main memory - the Azure Blob storage, or local instance memory.

The issue you are facing however cannot be workaround-ed, because of the nature of IFilter interface. If you look a bit deeper into the source you are using from here, you will discover that under the covers it uses IPersistFile COM interface. Unfortunately this interface only works with local files and does not accept streams.

What I would have suggested is to use Stream from Blob and pass it to the Reader, instead of the physical path. However, as already said - IFilter uses COM interfaces which work only with physical paths. So with your current approach there is no way to skip blob downloading.

There is nothing scary about downloading blobs locally. If the storage account is in the same affinity group as the compute, the download will be extremely fast, the traffic will be free. Given you use a small instance size, you will have 165GB for local storage. Which is plenty of storage. You can optimize your process a bit by keeping track of what is indexed and what not. You can use Azure Table storage for that. Another extremely fast and cheap storage solution which is perfect for storing key-value pairs as file name - etag. Then when you enumerate the blobs, first fetch the etag for a blob and check with the table whether it is already indexed or not. Download it only if it is not indexed, then add new record to the Table to mark this file as indexed.

Or... Or don't use IFilter. I don't see any benefit of using IFilter on Azure. IFilters are only registered when the Application is installed. For instance if you want to process Office documents with IFilter - you have to install Microsoft Office on the VM (which currently you can't do, even if you have license, because of license mobility restrictions for MS Office). If you want to get the IFilter for PDF - you have to install Adobe Acrobat Reader (which you can do via a startup task). And so on, and so on - some applications you can install, some you can't. Your Windows Azure VM Instance is plain Windows with no IFilters at all. Imagine a basic installation of Windows Server 2008 R2, with no roles and no features added - that is your instance.