First, you'll need to separate the word-matching code from the string-literal-matching code. For word matching, use:
\w+
Next there's whitespace.
\s+
To match strings as one token, you need to allow more characters than just \w
. That only allows alphanumeric characters and _
, which means whitespace and symbols are not. You also need to move the starting and ending quotes outside of the square brackets.
And don't forget backslashes to escape characters. You want to allow \"
inside of strings.
"(\\.|[^"])+"
Finally, there are the symbols. You could list all the symbols, or you could just treat any non-word, non-whitespace, non-quote character as a symbol. I recommend the latter so you don't choke on other symbols like @
or |
. So for symbols:
[^\s\w"]
Putting the pieces together, we get this combined regex:
\w+|\s+|"(\\.|[^"])+"|[^\s\w"]
Or, escaping everything properly so it can be put into source code:
Pattern pattern = Pattern.compile("\\w+|\\s+|\"(\\\\.|[^\"])+\"|[^\\s\\w\"]");