Question

I am trying to split a string according to a certain set of delimiters.
My delimiters are: ,"():;.!? single spaces or multiple spaces. This is the code i'm currently using,

String[] arrayOfWords= inputString.split("[\\s{2,}\\,\"\\(\\)\\:\\;\\.\\!\\?-]+");

which works fine for most cases but i'm have a problem when the the first word is surrounded by quotation marks. For example

String inputString = "\"Word\" some more text.";

Is giving me this output

arrayOfWords[0] = ""
arrayOfWords[0] = "Word"
arrayOfWords[1] = "some"
arrayOfWords[2] = "more"
arrayOfWords[3] = "text"

I want the output to give me an array with

arrayOfWords[0] = "Word"
arrayOfWords[1] = "some"
arrayOfWords[2] = "more"
arrayOfWords[3] = "text"

This code has been working fine when quotation marks are used in the middle of the sentence, I'm not sure what the trouble is when it's at the beginning.

EDIT: I just realized I have same problem when any of the delimiters are used as the first character of the string

Was it helpful?

Solution

Unfortunately you wont be able to remove this empty first element using only split. You should probably remove first elements from your string that match your delimiters and split after it. Also your regex seems to be incorrect because

  • by adding {2,} inside [...] you are in making { 2 , and } characters delimiters,
  • you don't need to escape rest of your delimiters (note that you don't have to escape - only because it is at end of character class [] so he cant be used as range operator).

Try maybe this way

String regexDelimiters = "[\\s,\"():;.!?\\-]+";
String inputString = "\"Word\"  some more text.";
String[] arrayOfWords = inputString.replaceAll(
        "^" + regexDelimiters,"").split(regexDelimiters);

for (String s : arrayOfWords)
    System.out.println("'" + s + "'");

output:

'Word'
'some'
'more'
'text'

OTHER TIPS

A delimiter is interpreted as separating the strings on either side of it, thus the empty string on its left is added to the result as well as the string to its right ("Word"). To prevent this, you should first strip any leading delimiters, as described here:

How to prevent java.lang.String.split() from creating a leading empty string?

So in short form you would have:

String delim = "[\\s,\"():;.!?\\-]+";
String[] arrayOfWords = inputString.replaceFirst("^" + delim, "").split(delim);

Edit: Looking at Pshemo's answer, I realize he is correct regarding your regex. Inside the brackets it's unnecessary to specify the number of space characters, as they will be caught be the + operator.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top