I think, the best suggestion is to use Lucene ....
you can save your documents as they are with some unique path name/file_name , and use that as identifier when you index the documents ... I am sure you can find a lot of similar examples if you search Lucene ..