Returning java regex (words, spaces, special characters, double quotes)

Question 1

First, you'll need to separate the word-matching code from the string-literal-matching code. For word matching, use:

\w+

Next there's whitespace.

\s+

To match strings as one token, you need to allow more characters than just \w. That only allows alphanumeric characters and _, which means whitespace and symbols are not. You also need to move the starting and ending quotes outside of the square brackets.

And don't forget backslashes to escape characters. You want to allow \" inside of strings.

"(\\.|[^"])+"

Finally, there are the symbols. You could list all the symbols, or you could just treat any non-word, non-whitespace, non-quote character as a symbol. I recommend the latter so you don't choke on other symbols like @ or |. So for symbols:

[^\s\w"]

Putting the pieces together, we get this combined regex:

\w+|\s+|"(\\.|[^"])+"|[^\s\w"]

Or, escaping everything properly so it can be put into source code:

Pattern pattern = Pattern.compile("\\w+|\\s+|\"(\\\\.|[^\"])+\"|[^\\s\\w\"]");

Question 2

Typically, when parsing text, the process you're describing is called "lexical analysis" and the function used is called a 'lexer' which is used to break up an input stream into identifiable tokens like words, numbers, spaces, periods, etc.

The output of a lexer is consumed by a 'parser' which does "syntactic analysis" by identifying groups of tokens which belong together, like [double-quote] [word] [double-quote].

I would recommend you follow the same two-pass strategy, since it's been proven time and again in many, many parsers.

So, your first step might be to use this regular expression as your lexer:

\W|\w+

which will break your input text into either single non-word characters (like spaces, double and single quotation marks, commas, periods, etc.) or sequences of one or more word characters where \w is really just a shortcut for [a-zA-Z_0-9].

So, using your example above:

String str=/"I" am_the 2nd "best"./

String p="\\W|\\w+"

Pattern pattern = Pattern.compile(p);
Matcher matcher = pattern.matcher(str);
List<String> matchlist = new ArrayList<String>();

while(matcher.find()) {
    matchlist.add(matcher.group(0));
}

produces:

['"', 'I', '"', ' ', 'am_the', ' ', '2nd', ' ', '"', 'best', '"', '.']

which you can then decide how to treat in your code.

No, this doesn't give you a single one-size-fits-all regular expression which matches both of the cases you list above, but in my experience, regular expressions aren't really the best tool to do the kind of syntactic analysis you require because they either lack the expressiveness needed to cover all possible cases or, and this is far more likely, they quickly become far too complex for most but the true RegExp maven to fully comprehend.