Question

I have a TreeSet<String> containing some keywords. I need to test some Strings to see if they contain any of these keywords.

I currently have:

String tweet = object.getText();
for (String keyword : keywords_set)
{
    if(tweet.contains(keyword))
    {
        return true;
    }
}

Is there a more elegant and efficient way of doing this for a stream of Strings?

Was it helpful?

Solution 2

Aho-Corasick multiple string matching algorithm: https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm

OTHER TIPS

You won't get any more efficient than what you have with JDK classes and methods. You need to go through each String in the Set and check if your String contains it.

However, you can make it cleaner if you are willing to use a 3rd party library, Guava.

With Guava, you can use Iterables.any(Iterable, Predicate) which

Returns true if any element in iterable satisfies the predicate.

Use it like so

Set<String> keywords_set = ...
final String tweet = ...

return Iterables.any(keywords_set, new Predicate<String>() {
    @Override
    public boolean apply(String input) {
        return tweet.contains(input);
    }           
});

With Java 8, it will be even cleaner thanks to lambda expressions and aggregate operations.

Here's an explicit example of how to use AhoCorasick.

I created a branch of Robert Bor's java Aho-Corasick implementation that adds a "matches" method which returns true as soon as the first match is found.

Building the search trie is non-trivial. I've included an in-efficient method implementation that matches the code sample you provided. But you really want to amortize the cost of the trie construction across a large number of searches. To do that you really want to change your code that calls the example you included.

I included an example of how to construct the search trie once and them use it for multiple searches.

public boolean doesTweetMatchSlow(String tweet, Set<String> keywords_set)
{
        Trie searchTrie = new Trie();
        for (String keyword : keywords_set) {
            searchTrie.addKeyword(keyword);
        }

        return searchTrie.matches(tweet);
}

public Collection<String> findMatchingTweetsFaster(Iterable<String> tweets, Set<String> keywords_set)
{
    List<String> matching = null;

    if (tweets != null) {
        matching = new ArrayList<>();

        if (keywords_set != null && !keywords_set.isEmpty()) {

            // build trie once.
            Trie searchTrie = new Trie();
            for (String keyword : keywords_set) {
                searchTrie.addKeyword(keyword);
            }

            for (String tweet : tweets) {
                // Re-use trie for every tweet.
                if (searchTrie.matches(tweet)) {
                    matching.add(tweet);
                }
            }
        }
    }
    return matching;
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top