Solr + DIH + Tika : indexing huge amount of files, how to handle deleted files?

https://stackoverflow.com/questions/13359792

28-11-2021
|

Question

I have a huge amount of PDF/Word/Excel/etc. files to index (40GB now, but maybe up to 1000GB in some monhts) and I was considering to use Solr, with a DataImportHandler and Tika. I have read a lot of topic on this subject, but there is one problem for which I still not found a solution : if I index all the files (full or delta import), remove a file in the filesystem, and index again (with delta import), then the document corresponding to the file will not be removed from the index.

Here are some possibilites :

Do a full import. But I want to avoid this as much as possible since I think it could be very time-consuming (several days, but not very important) and bandwidth-consuming (the main issue since files are on a shared network drive).
Implement a script which would verify, for each document in the index, if the corresponding file exist (much less bandwidth consuming). But I do not know if I shall do this inside or outside of Solr, and how.

Do you have any other idea, or a way to perform the second solution ? Thanks in advance.

Some details :

I will use the "newerThan" option of the FileListEntityProcessor to do the delta import.
If I store the date when the document has been indexed, it does not help me because if I haven't indexed one document in the last import it can be because he as been removed OR because it has not changed (delta import)
I have both stored and unstored fields, so I don't think using the new possibility of Solr 4.0 to change only one field in a document can be a solution.

Solution

Have you thought about using a file system monitor to catch deletions and update index?

I think apache.commons.io supports that.
Check out apache.commons.io.monitor package, FileAlterationObserver and FileAlterationMonitor classes.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow