Consider using the Apache Lucene API It provides functionality for searching, stemming, tokenization, indexing, document similarity scoring. Its an open source implementation of basic best practices in Information Retrieval
The functionality that seems most useful to you from Lucene is their moreLikeThis algorithm which uses Latent Semantic Analysis to locate similar documents.