Question

I am using Lucene.Net + custom crawler + Ifilter so that I can index data inside blob.

foreach (var item in containerList)
            {
                CloudBlobContainer container = BlobClient.GetContainerReference(item.Name);
                if (container.Name != "indexes")
                {
                    IEnumerable<IListBlobItem> blobs = container.ListBlobs();
                    foreach (CloudBlob blob in blobs)
                    {
                        CloudBlobContainer blobContainer = blob.Container;
                        CloudBlob blobToDownload = blobContainer.GetBlobReference(blob.Name);

                        blob.DownloadToFile(path+blob.Name);
                        indexer.IndexBlobData(path,blob);
                        System.IO.File.Delete(path+blob.Name);
                    }
                }
            }
/*Code for crawling which downloads file Locally on azure instance storage*/

The below code is indexer function which uses IFilter

public bool IndexBlobData(string path, CloudBlob blob)
    {
        Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();
        try
        {
            TextReader reader = new FilterReader(path + blob.Name);
            doc.Add(new Lucene.Net.Documents.Field("url", blob.Uri.ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.NOT_ANALYZED));
            doc.Add(new Lucene.Net.Documents.Field("content", reader.ReadToEnd().ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.ANALYZED));
            indexWriter.AddDocument(doc);
            reader.Close();
            return true;
        }
        catch (Exception e)
        {
            return false;
        }
    }

Now my issue is I don't want to DOWNLOAD file on instance storage.. I directly want to pass the File to FilterReader. But it takes "Physical" path, passing http address doesn't work. Can anybody suggest any other workaround? I don't want to download same file again from blob and then index it, instead i will prefer download and keep it in main memory and directly use index filter.

I am using IFilter from here

Was it helpful?

Solution

It is not very clear what do you mean by I don't want to download same file again from blob and then index it, instead i will prefer download and keep it in main memory and directly use index filter? What is that main memory - the Azure Blob storage, or local instance memory.

The issue you are facing however cannot be workaround-ed, because of the nature of IFilter interface. If you look a bit deeper into the source you are using from here, you will discover that under the covers it uses IPersistFile COM interface. Unfortunately this interface only works with local files and does not accept streams.

What I would have suggested is to use Stream from Blob and pass it to the Reader, instead of the physical path. However, as already said - IFilter uses COM interfaces which work only with physical paths. So with your current approach there is no way to skip blob downloading.

There is nothing scary about downloading blobs locally. If the storage account is in the same affinity group as the compute, the download will be extremely fast, the traffic will be free. Given you use a small instance size, you will have 165GB for local storage. Which is plenty of storage. You can optimize your process a bit by keeping track of what is indexed and what not. You can use Azure Table storage for that. Another extremely fast and cheap storage solution which is perfect for storing key-value pairs as file name - etag. Then when you enumerate the blobs, first fetch the etag for a blob and check with the table whether it is already indexed or not. Download it only if it is not indexed, then add new record to the Table to mark this file as indexed.

Or... Or don't use IFilter. I don't see any benefit of using IFilter on Azure. IFilters are only registered when the Application is installed. For instance if you want to process Office documents with IFilter - you have to install Microsoft Office on the VM (which currently you can't do, even if you have license, because of license mobility restrictions for MS Office). If you want to get the IFilter for PDF - you have to install Adobe Acrobat Reader (which you can do via a startup task). And so on, and so on - some applications you can install, some you can't. Your Windows Azure VM Instance is plain Windows with no IFilters at all. Imagine a basic installation of Windows Server 2008 R2, with no roles and no features added - that is your instance.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top