Question

I am developing an application in which I need to process text files containing emails. I need all the tokens from the text and the following is the definition of token:

  1. Alphanumeric
  2. Case-sensitive (case to be preserved)
  3. '!' and '$' are to be considered as constituent characters. Ex: FREE!!, $50 are tokens
  4. '.' (dot) and ',' comma are to be considered as constituent characters if they occur between numbers. For ex:

    192.168.1.1, $24,500

    are tokens.

and so on..

Please suggest me some open-source tokenizers for Java which are easy to customize to suit my needs. Will simply using StringTokenizer and regex be enough? I have to perform stopping also and that's why I was looking for an open source tokenizer which will also perform some extra things like stopping, stemming.

Was it helpful?

Solution

A few comments up front:

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

  • Always use Google first - the first result as of now is JTopas. I did not use it, but it looks it could work for this

As for regex, it really depends on your requirements. Given the above, this might work:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Mkt {
  public static void main(String[] args) {
    Pattern p = Pattern.compile("([$\\d.,]+)|([\\w\\d!$]+)");
    String str = "--- FREE!! $50 192.168.1.1 $24,500";
    System.out.println("input: " + str);

    Matcher m = p.matcher(str);
    while(m.find()) {
      System.out.println("token: " + m.group());
    }
  }
}

Here's a sample run:

$ javac Mkt.java && java Mkt
input: --- FREE!! $50 192.168.1.1 $24,500
token: FREE!!
token: $50
token: 192.168.1.1
token: $24,500

Now, you might need to tweak the regex, for example:

  • You gave $24,500 as an example. Should this work for $24,500abc or $24,500EUR?
  • You mentioned 192.168.1.1 should be included. Should it also include 192,168.1,1 (given . and , are to be included)?

and I guess there are other things to consider.

Hope this helps to get you started.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top