Question

I want to split a string based on comma excluding one which are in double quotes, also if there is any adjacent commas they should be counted as separate tokens

I am able to do string split on comma excluding one which are in double quotes using regex [^,\"']+|\"([^\"]*)\"

but it fails to tokenize properly if there are adjacent commas. example for string

one,two,three,four,"five1,five2", six ,seven,"eight1,eight2","nine",,eleven

output should be

one
two
three
four
five1,five2
six
seven
eight1,eight2
nine

eleven

please help

Was it helpful?

Solution

If all of your quote are matched, every comma you want to split at will be followed by an even number of ". So you could use a lookahead and stuff this thing into myString.split(pattern, -1):

,(?=(?:(?:[^\"]*\"){2})*[^\"]*$)

This will only match if there is an even number of " between the comma in question and the end of the string.

Note that the -1 argument for split is important, otherwise trailing empty strings will be omitted.

Side note: I don't know how well the Java regex engine optimizes, so this lookahead might be quite inefficient if it fails, because it unnecessarily backtracks. If you experience performance issues, try making the quantifiers possessive:

,(?=(?:(?:[^\"]*+\"){2})*+[^\"]*+$)

This will stop the engine from backtracking.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top