Question

I am trying to use java regex to tokenize any language source file. What I want the list to return is:

  • words ([a-z_A-Z0-9])
  • spaces
  • any of [()*.,+-/=&:] as a single character
  • and quoted items left in quotes.

Here is the code I have so far:

Pattern pattern = Pattern.compile("[\"(\\w)\"]+|[\\s\\(\\)\\*\\+\\.,-/=&:]");

Matcher matcher = pattern.matcher(str);
List<String> matchlist = new ArrayList<String>();

while(matcher.find()) {
    matchlist.add(matcher.group(0));
}

For example,

"I" am_the 2nd "best".

returns: list, size 8

("I", ,am_the, ,2nd, ,"best", .)

which is what I want. However, if the whole sentence is quoted, except for the period:

"I am_the 2nd best".

returns: list, size 8

("I, ,am_the, ,2nd, ,best", .)

and I want it to be able to return: list, size 2

("I am_the 2nd best", .)

If that makes sense. I believe it works for everything I want it to except for returning string literals (which I want to keep the quotes). What is it that I am missing from the pattern that will allow me to achieve this?

And by all means, if there is an easier pattern to use that I do not see, please help me out. The pattern shown above was the compilation of many trial/error. Thank you very much in advance for any help.

Was it helpful?

Solution

First, you'll need to separate the word-matching code from the string-literal-matching code. For word matching, use:

\w+

Next there's whitespace.

\s+

To match strings as one token, you need to allow more characters than just \w. That only allows alphanumeric characters and _, which means whitespace and symbols are not. You also need to move the starting and ending quotes outside of the square brackets.

And don't forget backslashes to escape characters. You want to allow \" inside of strings.

"(\\.|[^"])+"

Finally, there are the symbols. You could list all the symbols, or you could just treat any non-word, non-whitespace, non-quote character as a symbol. I recommend the latter so you don't choke on other symbols like @ or |. So for symbols:

[^\s\w"]

Putting the pieces together, we get this combined regex:

\w+|\s+|"(\\.|[^"])+"|[^\s\w"]

Or, escaping everything properly so it can be put into source code:

Pattern pattern = Pattern.compile("\\w+|\\s+|\"(\\\\.|[^\"])+\"|[^\\s\\w\"]");

OTHER TIPS

Typically, when parsing text, the process you're describing is called "lexical analysis" and the function used is called a 'lexer' which is used to break up an input stream into identifiable tokens like words, numbers, spaces, periods, etc.

The output of a lexer is consumed by a 'parser' which does "syntactic analysis" by identifying groups of tokens which belong together, like [double-quote] [word] [double-quote].

I would recommend you follow the same two-pass strategy, since it's been proven time and again in many, many parsers.

So, your first step might be to use this regular expression as your lexer:

\W|\w+

which will break your input text into either single non-word characters (like spaces, double and single quotation marks, commas, periods, etc.) or sequences of one or more word characters where \w is really just a shortcut for [a-zA-Z_0-9].

So, using your example above:

String str=/"I" am_the 2nd "best"./

String p="\\W|\\w+"

Pattern pattern = Pattern.compile(p);
Matcher matcher = pattern.matcher(str);
List<String> matchlist = new ArrayList<String>();

while(matcher.find()) {
    matchlist.add(matcher.group(0));
}

produces:

['"', 'I', '"', ' ', 'am_the', ' ', '2nd', ' ', '"', 'best', '"', '.']

which you can then decide how to treat in your code.

No, this doesn't give you a single one-size-fits-all regular expression which matches both of the cases you list above, but in my experience, regular expressions aren't really the best tool to do the kind of syntactic analysis you require because they either lack the expressiveness needed to cover all possible cases or, and this is far more likely, they quickly become far too complex for most but the true RegExp maven to fully comprehend.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top