Question

I have a very simple code taken from this example, where I am using the Lin, Path and Wu-Palmer similarity measures to compute the similarity between two words. My code is as follows:

import edu.cmu.lti.lexical_db.ILexicalDatabase;
import edu.cmu.lti.lexical_db.NictWordNet;
import edu.cmu.lti.ws4j.RelatednessCalculator;
import edu.cmu.lti.ws4j.impl.Lin;
import edu.cmu.lti.ws4j.impl.Path;
import edu.cmu.lti.ws4j.impl.WuPalmer;

public class Test {
    private static ILexicalDatabase db = new NictWordNet();
    private static RelatednessCalculator lin = new Lin(db);
    private static RelatednessCalculator wup = new WuPalmer(db);
    private static RelatednessCalculator path = new Path(db);

    public static void main(String[] args) {
        String w1 = "walk";
        String w2 = "trot";
        System.out.println(lin.calcRelatednessOfWords(w1, w2));
        System.out.println(wup.calcRelatednessOfWords(w1, w2));
        System.out.println(path.calcRelatednessOfWords(w1, w2));
    }
}

And the scores are as expected EXCEPT when both words are identical. If both words are the same (e.g. w1 = "walk"; w2 = "walk";), the three measures I have should each return 1.0. But instead, they are returning 1.7976931348623157E308.

I have used ws4j before (the same version, in fact), but I have never seen this behavior. Searching online has not yielded any clues. What could possibly be going wrong here?

P.S. The fact that the Lin, Wu-Palmer and Path measures should return 1 can also be verified with the online demo provided by ws4j

Was it helpful?

Solution 2

I had raised this issue at the googlecode site for ws4j, and it turns out that indeed it was a bug. The reply I received is as follows:

This looks like it is due to attempting to override a protected static field (this can't be done in Java). The attached patch fixes the issue by moving the definition of min and max the fields to non-static final members in RelatednessCalculator and adding getters. Implementations then provide their min/max values through super constructor calls.

Patch can be applied with patch -p1 < 0001-Cannot-override-static-members-replacing-fields-with.patch

And here is the (now resolved) issue on their site.

OTHER TIPS

I had a similar problem, and here's what's going on here. I hope that other people who run into this problem will find by response helpful.

If you have noticed, the online demo allows you to choose word sense by specifying word in the following format: word#pos_tag#word_sense. For example, a noun gender with the first word sense would be gender#n#1.

Your code snippet uses the first word sense by default. When I calculate WuPalmer similarity between "gender" and "sex", it will return 0.26. If I use online demo, it will return 1.0. But if we use "gender#n#1" and "sex#n#1" the online demo will return 0.26, so there is no discrepancy. The online demo calculates the max of all pos tag / word sense pairs. Here's a corresponding snippet of code that should do the trick:

ILexicalDatabase db = new NictWordNet();
WS4JConfiguration.getInstance().setMFS(true);
RelatednessCalculator rc = new Lin(db);
String word1 = "gender";
String word2 = "sex";
List<POS[]> posPairs = rc.getPOSPairs();
double maxScore = -1D;

for(POS[] posPair: posPairs) {
    List<Concept> synsets1 = (List<Concept>)db.getAllConcepts(word1, posPair[0].toString());
    List<Concept> synsets2 = (List<Concept>)db.getAllConcepts(word2, posPair[1].toString());

    for(Concept synset1: synsets1) {
        for (Concept synset2: synsets2) {
            Relatedness relatedness = rc.calcRelatednessOfSynset(synset1, synset2);
            double score = relatedness.getScore();
            if (score > maxScore) { 
                maxScore = score;
            }
        }
    }
}

if (maxScore == -1D) {
    maxScore = 0.0;
}

System.out.println("sim('" + word1 + "', '" + word2 + "') =  " + maxScore);

Also, this will give you 0.0 similarity on non-stemmed word forms, e.g. 'genders' and 'sex.' You can use a porter stemmer included in ws4j to make sure you stem words beforehand if needed.

Hope this helps!

Here is why -

In jcn we have...

sim(c1, c2) = 1 / distance(c1, c2)

distance(c1, c2) = ic(c1) + ic(c2) - (2 * ic(lcs(c1, c2)))

where c1, c2 are the two concepts, ic is the information content of the concept. lcs(c1, c2) is the least common subsumer of c1 and c2.

Now, we don't want distance to be 0 (=> similarity will become undefined).

distance can be 0 in 2 cases...

(1) ic(c1) = ic(c2) = ic(lcs(c1, c2)) = 0

ic(lcs(c1, c2)) can be 0 if the lcs turns out to be the root node (information content of the root node is zero). But since c1 and c2 can never be the root node, ic(c1) and ic(c2) would be 0 only if the 2 concepts have a 0 frequency count, in which case, for lack of data, we return a relatedness of 0 (similar to the lin case).

Note that the root node ACTUALLY has an information content of zero. Technically, none of the other concepts can have an information content value of zero. We assign concepts zero values, when in reality their information content is undefined (due to zero frequency counts). To see why look at the formula for information content: ic(c) = -log(freq(c)/freq(ROOT)) {log(0)? log(1)?}

(2) The second case that distance turns out to be zero is when...

ic(c1) + ic(c2) = 2 * ic(lcs(c1, c2))

(which could have a more likely special case ic(c1) = ic(c2) = ic(lcs(c1, c2)) if all three turn out to be the same concept.)

How should one handle this?

Intuitively this is the case of maximum relatedness (zero distance). For jcn this relatedness would be infinity... But we can't return infinity. And simply returning a 0 wouldn't work... since here we have found a pair of concepts with maximum relatedness, and returning a 0 would be like saying that they aren't related at all.

1.7976931348623157E308 is the value of Double.MAX_VALUE but the maximum value of some similarity/relatedness algo (Lin, WuPalmer and Path) are between 0 and 1. Then , for identical synset, the maxium value can be returned is 1. Into the version of my repo (https://github.com/DonatoMeoli/WS4J) i fixed this and other bugs.

Now, for two identical words, the values returned are:

HirstStOnge 16.0
LeacockChodorow 1.7976931348623157E308
Lesk    1.7976931348623157E308
WuPalmer    1.0
Resnik  1.7976931348623157E308
JiangConrath    1.7976931348623157E308
Lin 1.0
Path    1.0
Done in 67 msec.

Process finished with exit code 0
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top