Question

I have a String that I have to parse for different keywords. For example, I have the String:

"I will come and meet you at the 123woods"

And my keywords are

'123woods' 'woods'

I should report whenever I have a match and where. Multiple occurrences should also be accounted for. However, for this one, I should get a match only on 123woods, not on woods. This eliminates using String.contains() method. Also, I should be able to have a list/set of keywords and check at the same time for their occurrence. In this example, if I have '123woods' and 'come', I should get two occurrences. Method execution should be somewhat fast on large texts.

My idea is to use StringTokenizer but I am unsure if it will perform well. Any suggestions?

Was it helpful?

Solution

The example below is based on your comments. It uses a List of keywords, which will be searched in a given String using word boundaries. It uses StringUtils from Apache Commons Lang to build the regular expression and print the matched groups.

String text = "I will come and meet you at the woods 123woods and all the woods";

List<String> tokens = new ArrayList<String>();
tokens.add("123woods");
tokens.add("woods");

String patternString = "\\b(" + StringUtils.join(tokens, "|") + ")\\b";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(text);

while (matcher.find()) {
    System.out.println(matcher.group(1));
}

If you are looking for more performance, you could have a look at StringSearch: high-performance pattern matching algorithms in Java.

OTHER TIPS

Use regex + word boundaries as others answered.

"I will come and meet you at the 123woods".matches(".*\\b123woods\\b.*");

will be true.

"I will come and meet you at the 123woods".matches(".*\\bwoods\\b.*");

will be false.

Hope this works for you:

String string = "I will come and meet you at the 123woods";
String keyword = "123woods";

Boolean found = Arrays.asList(string.split(" ")).contains(keyword);
if(found){
      System.out.println("Keyword matched the string");
}

http://codigounico.blogspot.com/

How about something like Arrays.asList(String.split(" ")).contains("xx")?

See String.split() and How can I test if an array contains a certain value.

Got a way to match Exact word from String in Android:

String full = "Hello World. How are you ?";

String one = "Hell";
String two = "Hello";
String three = "are";
String four = "ar";


boolean is1 = isContainExactWord(full, one);
boolean is2 = isContainExactWord(full, two);
boolean is3 = isContainExactWord(full, three);
boolean is4 = isContainExactWord(full, four);

Log.i("Contains Result", is1+"-"+is2+"-"+is3+"-"+is4);

Result: false-true-true-false

Function for match word:

private boolean isContainExactWord(String fullString, String partWord){
    String pattern = "\\b"+partWord+"\\b";
    Pattern p=Pattern.compile(pattern);
    Matcher m=p.matcher(fullString);
    return m.find();
}

Done

Try to match using regular expressions. Match for "\b123wood\b", \b is a word break.

A much simpler way to do this is to use split():

String match = "123woods";
String text = "I will come and meet you at the 123woods";

String[] sentence = text.split();
for(String word: sentence)
{
    if(word.equals(match))
        return true;
}
return false;

This is a simpler, less elegant way to do the same thing without using tokens, etc.

The solution seems to be long accepted, but the solution could be improved, so if someone has a similar problem:

This is a classical application for multi-pattern-search-algorithms.

Java Pattern Search (with Matcher.find) is not qualified for doing that. Searching for exactly one keyword is optimized in java, searching for an or-expression uses the regex non deterministic automaton which is backtracking on mismatches. In worse case each character of the text will be processed l times (where l is the sum of the pattern lengths).

Single pattern search is better, but not qualified, too. One will have to start the whole search for every keyword pattern. In worse case each character of the text will be processed p times where p is the number of patterns.

Multi pattern search will process each character of the text exactly once. Algorithms suitable for such a search would be Aho-Corasick, Wu-Manber, or Set Backwards Oracle Matching. These could be found in libraries like Stringsearchalgorithms or byteseek.

// example with StringSearchAlgorithms

AhoCorasick stringSearch = new AhoCorasick(asList("123woods", "woods"));

CharProvider text = new StringCharProvider("I will come and meet you at the woods 123woods and all the woods", 0);

StringFinder finder = stringSearch.createFinder(text);

List<StringMatch> all = finder.findAll();
public class FindTextInLine {
    String match = "123woods";
    String text = "I will come and meet you at the 123woods";

    public void findText () {
        if (text.contains(match)) {
            System.out.println("Keyword matched the string" );
        }
    }
}

You can use regular expressions. Use Matcher and Pattern methods to get the desired output

You can also use regex matching with the \b flag (whole word boundary).

To Match "123woods" instead of "woods" , use atomic grouping in regular expresssion. One thing to be noted is that, in a string to match "123woods" alone , it will match the first "123woods" and exits instead of searching the same string further.

\b(?>123woods|woods)\b

it searches 123woods as primary search, once it got matched it exits the search.

Looking back at the original question, we need to find some given keywords in a given sentence, count the number of occurrences and know something about where. I don't quite understand what "where" means (is it an index in the sentence?), so I'll pass that one... I'm still learning java, one step at a time, so I'll see to that one in due time :-)

It must be noticed that common sentences (as the one in the original question) can have repeated keywords, therefore the search cannot just ask if a given keyword "exists or not" and count it as 1 if it does exist. There can be more then one of the same. For example:

// Base sentence (added punctuation, to make it more interesting):
String sentence = "Say that 123 of us will come by and meet you, "
                + "say, at the woods of 123woods.";

// Split it (punctuation taken in consideration, as well):
java.util.List<String> strings = 
                       java.util.Arrays.asList(sentence.split(" |,|\\."));

// My keywords:
java.util.ArrayList<String> keywords = new java.util.ArrayList<>();
keywords.add("123woods");
keywords.add("come");
keywords.add("you");
keywords.add("say");

By looking at it, the expected result would be 5 for "Say" + "come" + "you" + "say" + "123woods", counting "say" twice if we go lowercase. If we don't, then the count should be 4, "Say" being excluded and "say" included. Fine. My suggestion is:

// Set... ready...?
int counter = 0;

// Go!
for(String s : strings)
{
    // Asking if the sentence exists in the keywords, not the other
    // around, to find repeated keywords in the sentence.
    Boolean found = keywords.contains(s.toLowerCase());
    if(found)
    {
        counter ++;
        System.out.println("Found: " + s);
    }
}

// Statistics:
if (counter > 0)
{
    System.out.println("In sentence: " + sentence + "\n"
                     + "Count: " + counter);
}

And the results are:

Found: Say
Found: come
Found: you
Found: say
Found: 123woods
In sentence: Say that 123 of us will come by and meet you, say, at the woods of 123woods.
Count: 5

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top