Google Guava MultiSet returning incorrect value

https://stackoverflow.com/questions/21463478

05-10-2022
|

Question

I am using Google Guava APIs to calculate word count.

public static void main(String args[])
    {
        String txt = "Lemurs of Madagascar is a reference work and field guide giving descriptions and biogeographic data for all the known lemur species in Madagascar (ring-tailed lemur pictured). It also provides general information about lemurs and their history and helps travelers identify species they may encounter. The primary contributor is Russell Mittermeier, president of Conservation International. The first edition in 1994 received favorable reviews for its meticulous coverage, numerous high-quality illustrations, and engaging discussion of lemur topics, including conservation, evolution, and the recently extinct subfossil lemurs. The American Journal of Primatology praised the second edition's updates and enhancements. Lemur News appreciated the expanded content of the third edition (2010), but was concerned that it was not as portable as before. The first edition identified 50 lemur species and subspecies, compared to 71 in the second edition and 101 in the third. The taxonomy promoted by these books has been questioned by some researchers who view these growing numbers of lemur species as insufficiently justified inflation of species numbers.";

        Iterable<String> result = Splitter.on(" ").trimResults(CharMatcher.DIGIT)
                   .omitEmptyStrings().split(txt);
        Multiset<String> words = HashMultiset.create(result);

        for(Multiset.Entry<String> entry : words.entrySet())
        {
            String word = entry.getElement();
            int count = words.count(word);
            System.out.printf("%S %d", word, count);
            System.out.println();
        }
    }

The output should be

Lemurs 3

However I am getting like this:

Lemurs 1
Lemurs 1
Lemurs 1

What am I doing wrong?

Solution

MultiSet works fine. Take a close look at your results - switching the printf to e.g. "|%S| %d" will help:

|lemurs.| 1
|lemurs| 1
|Lemurs| 1

It is immediately apparent that those are all 3 different strings. The solution in this case is to simply strip all non-alphabetical chars, and lowercase all words.

OTHER TIPS

Using printf("%S %d", words, count) with a capital S hides the detail that the different capitalizations of the word "lemurs" are being counted separately. When I run that program, I see

one occurence of "lemurs." with a period not being trimmed
one occurrence of "lemurs" all lowercase
one occurrence of "Lemurs" with the first letter capitalized

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow