Question

I have a huge amount of PDF/Word/Excel/etc. files to index (40GB now, but maybe up to 1000GB in some monhts) and I was considering to use Solr, with a DataImportHandler and Tika. I have read a lot of topic on this subject, but there is one problem for which I still not found a solution : if I index all the files (full or delta import), remove a file in the filesystem, and index again (with delta import), then the document corresponding to the file will not be removed from the index.

Here are some possibilites :

  • Do a full import. But I want to avoid this as much as possible since I think it could be very time-consuming (several days, but not very important) and bandwidth-consuming (the main issue since files are on a shared network drive).
  • Implement a script which would verify, for each document in the index, if the corresponding file exist (much less bandwidth consuming). But I do not know if I shall do this inside or outside of Solr, and how.

Do you have any other idea, or a way to perform the second solution ? Thanks in advance.

Some details :

  • I will use the "newerThan" option of the FileListEntityProcessor to do the delta import.
  • If I store the date when the document has been indexed, it does not help me because if I haven't indexed one document in the last import it can be because he as been removed OR because it has not changed (delta import)
  • I have both stored and unstored fields, so I don't think using the new possibility of Solr 4.0 to change only one field in a document can be a solution.
Was it helpful?

Solution

Have you thought about using a file system monitor to catch deletions and update index?

I think apache.commons.io supports that.
Check out apache.commons.io.monitor package, FileAlterationObserver and FileAlterationMonitor classes.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top