Question

If I have a String which is delimited by a character, let's say this:

a-b-c

and I want to keep the delimiters, I can use look-behind and look-ahead to keep the delimiters themselves, like:

string.split("((?<=-)|(?=-))");

which results in

  • a
  • -
  • b
  • -
  • c

Now, if one of the delimiters is escaped, like this:

a-b\-c

And I want to honor the escape, I figured out to use a regex like this:

((?<=-(?!(?<=\\-))) | (?=-(?!(?<=\\-))))  

ergo

string.split("((?<=-(?!(?<=\\\\-)))|(?=-(?!(?<=\\\\-))))"):

Now, this works and results in:

  • a
  • -
  • b\-c

(The backslash I'd later remove with string.replace("\\", "");, I haven't found a way to include that in the regex)

My Problem is one of understanding.
The way I understood it, the regex would be, in words,

split ((if '-' is before (unless ('\-' is before))) or (if '-' is after (unless ('\-' is before))))

Why shouldn't the last part be "unless \ is before"? If '-' is after, that means we're between '\' and '-', so only \ should be before, not \\-, but it doesn't work if I change the regex to reflect that like this:

((?<=-(?!(?<=\\-))) | (?=-(?!(?<=\\))))  

Result: a, -, b\, -c

What is the reason for this? Where is my error in reasoning?

Was it helpful?

Solution 2

Why shouldn't the last part be "unless \ is before"?

In

(?=-(?!(?<=\\-)))) 
    ^here

cursor is after - so "unless \ is before" will always be false since we always have - before current position.


Maybe easier regex would be

(?<=(?<!\\\\)-)|(?=(?<!\\\\)-)

  • (?<=(?<!\\\\)-) will check if we are after - that has no \ before.
  • (?=(?<!\\\\)-)will check if we are before - that has no \ before.

OTHER TIPS

While this does not really answer the question, this explains how lookarounds work.

Lookarounds are anchors: they do not consume text, but find a position in the input text. Your regex can be written in a much more simple manner:

(?<=-)(?<!\\-)|(?=-)(?<!\\)

You have all four lookarounds here: positive and negative lookbehind, positive and negative lookahead.

The full regex reads:

(?<=-)            # Find a position where what precedes is a dash
(?<!\\-)          # Find a position where what precedes is not \-
|                 # Or
(?=-)             # Find a position where what follows is a dash
(?<!\\)           # Find a position where what precedes is not a \

Note the term "position". Note that an anchor will not advance in the text at all.

Now, if we try and match that regex against a-b\-c:

# Step 1
# Input:    | a-b\-c|
# Position: |^      |
# Regex:    | (?<=-)(?<!\\-)|(?=-)(?<!\\)|
# Position: |^                           |
# No match, try other alternative
# Input:    | a-b\-c|
# Position: |^      |
# Regex:    |(?<=-)(?<!\\-)| (?=-)(?<!\\)|
# Position: |               ^            |
# No match, regex fails
# Advance one position in the input text and try again

# Step 2
# Input:    |a -b\-c|
# Position: | ^     |
# Regex:    | (?<=-)(?<!\\-)|(?=-)(?<!\\)|
# Position: |^                           |
# No match, try other alternative
# Input:    |a -b\-c|
# Position: | ^     |
# Regex:    |(?<=-)(?<!\\-)| (?=-)(?<!\\)|
# Position: |               ^            |
# Match: a "-" follows
# Input:    |a -b\-c|
# Position: | ^     |
# Regex:    |(?<=-)(?<!\\-)|(?=-) (?<!\\)|
# Position: |                    ^       |
# Match: what precedes is not a \
# Input:    |a -b\-c|
# Position: | ^     |
# Regex:    |(?<=-)(?<!\\-)|(?=-)(?<!\\) |
# Position: |                           ^|
# Regex is satisfied

Here is an alternative which does not use split and no lookarounds:

[a-z]+(\\-[a-z]+)*|-

You can use this regex in a Pattern and use a Matcher:

public static void main(final String... args)
{
    final Pattern pattern
        = Pattern.compile("[a-z]+(\\\\-[a-z]+)*|-");

    final Matcher m = pattern.matcher("a-b\\-c");
    while (m.find())
        System.out.println(m.group());
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top