Question

I'm trying to find a good way to get a Scanner to use a given delimiter as a token. For example, I'd like to split up a piece of text into digit and non-digit chunks, so ideally I'd just set the delimiter to \D and set some flag like useDelimiterAsToken, but after briefly looking through the API I'm not coming up with anything. Right now I've had to resort to using combined lookaheads/lookbehinds for the delimiter, which is somewhat painful:

scanner.useDelimiter("((?<=\\d)(?=\\D)|(?<=\\D)(?=\\d))");

This looks for any transition from a digit to a non-digit or vice-versa. Is there a more sane way to do this?

Was it helpful?

Solution

EDIT: The edited question is so different, my original answer doesn't apply at all. For the record, what you're doing is the ideal way to solve your problem, in my opinion. Your delimiter is the zero-width boundary between a digit and a non-digit, and there's no more succinct way to express that than what you posted.

EDIT2: (In response to the question asked in the comment.) You originally asked for an alternative to this regex:

"((?<=\\w)(?=[^\\w])|(?<=[^\\w])(?=\\w))"

That's almost exactly how \b, the word-boundary construct, works:

"(?<=\\w)(?!\\w)|(?<!\\w)(?=\\w)"

That is, a position that's either preceded by a word character and not followed by one, or followed by a word character and not preceded by one. The difference is that \b can match at the beginning and end of the input. You obviously didn't want that, so I added lookarounds to exclude those conditions:

"(?!^)\\b(?!$)"

It's just a more concise way to do what your regex did. But then you changed the requirement to matching digit/non-digit boundaries, and there's no shorthand for that like \b for word/non-word boundaries.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top