Question

I have been using the Soundex Algorithm, which I found ready in Java http://introcs.cs.princeton.edu/java/31datatype/Soundex.java.html . The main use of the program is to ready a .cvs file and then after it saves its entries into arrays, it checks with the help of this algorithm one of these arrays for phonetic similarities. (More about the soundex algorithm http://en.wikipedia.org/wiki/Soundex).

My .cvs file has more or less 200.000 entries, so for that reason, it takes 5 hours to check the 30.000 entries, which fact I consider quite slow. [My algorithm checks every entry of the array with all the other entries, except the ones that are already checked - So, I don't think that there is a problem here].

So, my question is: Is there a way to reduce this time?

I have been thinking about connecting directly my database to the program with the help of SQL but I don't know if there is another way to do that, which would be faster.

Please any suggestion would be very helpful.

Was it helpful?

Solution

I don't know how the Java algorithm works. A lot of databases include a soundex() function. This converts a string into another string representing the sound.

You can then do the comparison between the resulting soundex strings.

This should go much, much faster than your current approach. You would have to test it to see if it returns acceptable results.

Actually, I just looked at the java code. You can take the same approach there. Go through the file, calculate the soundex for each entry. Then do the comparison afterwards -- perhaps by sorting the file and looking for duplicates.

OTHER TIPS

Just use the soundex implementation in your database. Most large popular databases have it built-in, e.g. PostgreSQL, MySQL or even Microsoft's T-SQL. It'll be easier to setup and likely a lot faster than whatever Java library you're using.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top