Java: how to find the most probable string in a list of strings?

https://stackoverflow.com//questions/12650409

11-12-2019
|

Question

I have a list of strings in Java containing first name of a person with dissimilar spellings (not entirely different). For example, John may be spelled as Jon, Jawn, Jaun etc. How should I retrieve the most appropriate string in this list. If anyone can suggest a method how to use Soundex in this case, it shall be of great help.

Solution

You have use approximate string matching algorithm , There are several strategies to implement this . Blur is a Trie-based Java implementation of approximate string matching based on the Levenshtein word distance. you can find the implementation at github here

There is another strategy to implement its called boyer-moore approximate string matching algorithm . Here is the Java code for that

The usual approach to solve these problem using this algorithm and Levenshtein word distance is to compare the input to the possible outputs and choose the one with the smallest distance to the desired output.

OTHER TIPS

There is one jar file for matching approximate string..

go through link and download frej.jar

http://sourceforge.net/projects/frej/files/

there is one method inside this jar file

Fuzzy.equals("jon","john");

it will return true in this type of approximate string.

Solr can do this, if you use the phonetic filter factory while indexing the text.

It is solr's speciality to search. And search for similar sounding words. However if you just want this, and not want other features offered by solr, then you can use the source available here.

There are lots of theories and methods to estimate the match of 2 strings

Giving a blunt true/false result seems strange since "jon" really doesn't equal "john", it's close but doesn't match

One great academic work implementing quite a few estimation methods is called "SecondString.jar" - site link

Most implemented methods give some score to the match, this score depends on the method used

Example: Lets define "Edit Distance" as the number of char changes required in str1 to get to str2 in this case "jon" --> "john" requires 1 char addition naturally for this method a lower score is better

This article provides a detailed explanation and complete code on a Trie-based Java implementation of approximate string matching: Fast and Easy Levenshtein distance using a Trie.

The search function returns a list of all words that are less than the given

maximum distance from the target word

def search( word, maxCost ):

# build first row
currentRow = range( len(word) + 1 )

results = []

# recursively search each branch of the trie
for letter in trie.children:
    searchRecursive( trie.children[letter], letter, word, currentRow, 
        results, maxCost )

return results

This recursive helper is used by the search function above. It assumes that

the previousRow has been filled in already.

def searchRecursive( node, letter, word, previousRow, results, maxCost ):

columns = len( word ) + 1
currentRow = [ previousRow[0] + 1 ]

# Build one row for the letter, with a column for each letter in the target
# word, plus one for the empty string at column 0
for column in xrange( 1, columns ):

    insertCost = currentRow[column - 1] + 1
    deleteCost = previousRow[column] + 1

    if word[column - 1] != letter:
        replaceCost = previousRow[ column - 1 ] + 1
    else:                
        replaceCost = previousRow[ column - 1 ]

    currentRow.append( min( insertCost, deleteCost, replaceCost ) )

# if the last entry in the row indicates the optimal cost is less than the
# maximum cost, and there is a word in this trie node, then add it.
if currentRow[-1] <= maxCost and node.word != None:
    results.append( (node.word, currentRow[-1] ) )

# if any entries in the row are less than the maximum cost, then 
# recursively search each branch of the trie
if min( currentRow ) <= maxCost:
    for letter in node.children:
        searchRecursive( node.children[letter], letter, word, currentRow, 
            results, maxCost )

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow