Frage

I'm looking for some good metric (cosine, chapman, jaccard, jaro, dice etc) to perform fuzzy matching of strings without considering words order. I am open for using combination of some metrics as well.

For example:

'john rambo' == 'jovn rambo'
'john rambo' == 'rambo jovn'
'john rambo' == 'john rambo x'
'john rambo the vietnam veteran' == 'john rambo the vietnam us veteran'

but

'john kerry' != 'john rambo'

I'm aiming at detection of similar strings when we have a typo, single letter or word added (for the last one, the strings being compared should have reasonable lengths to say that they are similar with additional word placed in one of them).

War es hilfreich?

Lösung

Condition of Equality: If the long string contains a sub-set that is equivalent to the set of words as the short string with each equivalent string have similarity > 75%.

I will be working in Java:

String str1 = "john rambo the vietnam veteran";
String str2 = "jovn rabbo the vittnam us vetteran";

The first step is to transform the text into word-bag:

ArrayList<String> a = new ArrayList<String> (Arrays.asList(str1.split(" ")));
ArrayList<String> b = new ArrayList<String> (Arrays.asList(str2.split(" ")));

Next is to start from the list that has the least number of words, and cross-check them in the other bag:

boolean are_equal = true;
boolean word_found = false;

if (a.size() < b.size())
{
    for (String a_word : a)
    {
        word_found = false;

        for (int i=0; i<b.size(); i++)
        {
            String b_word = b.get(i);

            if (is_similar_enough (a_word, b_word))
            {
                word_found = true;
                b.remove(i);
                break;
            }
        }

        if (!word_found)
        {
            are_equal = false;
            break;
        }
    }
}
else
{
   // ..
}

Condition of Word Similarity: Let's say that two words are similar if they both have the same number of characters, and the number of equivalent (equivalent means symmetrically similar) characters make of 75% or more.

but before that we have to define is_similar_enough function:

public boolean is_similar_enough (String a, String b)
{
    int equivlantChars = 0;

    if (a.length() != b.length())
        return false;

    for (int i=0; i<a.length(); i++)
        if (a.toCharArray[i] == b.toCharArray[i])
            equivlantChars ++;

    return ((((double)equivlantChars) / ((double)a.length())) >= 0.75);
}
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top