I don't know how the Java algorithm works. A lot of databases include a soundex()
function. This converts a string into another string representing the sound.
You can then do the comparison between the resulting soundex strings.
This should go much, much faster than your current approach. You would have to test it to see if it returns acceptable results.
Actually, I just looked at the java code. You can take the same approach there. Go through the file, calculate the soundex for each entry. Then do the comparison afterwards -- perhaps by sorting the file and looking for duplicates.