Question

If I have a string like this (from a Wiki-markup) that I need to parse in Java:

this link (is [[ inParen ]] and) (this) one is [[ notInParen ]]

I'd like to use regex to extract the texts inside the [[ ]] but not if they are inside parentheses. For example, in the example above it should return:

notInParen

But ignore:

inParen and this

... since they are inside parentheses. I can find the parentheses and the brackets separately no problem:

.*\(.*?\).* and .*?\[\[(.*?\]\].*

...but can't figure out how to find the [[ ]], look around for parentheses, and ignore. Thanks!

Was it helpful?

Solution

This is a fine regex

\(.*?\)|\[\[(.*?)]]

Your desired match will be in group 1

FYI, to make it better perform you can minimize backtracking by replacing the lazy match with a negated character class.

In Java this becomes

String ResultString = null;
try {
    Pattern regex = Pattern.compile("\\(.*?\\)|\\[\\[(.*?)\\]\\]", Pattern.DOTALL | Pattern.MULTILINE);
    Matcher regexMatcher = regex.matcher(subjectString);
    if (regexMatcher.find()) {
        ResultString = regexMatcher.group(1);
    } 
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
}

Note that group 1 will be empty for the cases the first part of the alternation did match.

OTHER TIPS

Does it need to be done in one go? You can do:

  • Parse the string and remove all substrings contained in parentheses.
  • Parse the result again and take all the desired Wikipedia links with [[ and ]].

This solves the problem and makes the problem easier to solve.

After step 1 you have: this link one is [[ notInParen ]].

After step 2 you have: notInParen.

You can also do it this way

String data = "this link (is [[ inParen ]] and) (this) one is [[ notInParen ]]" +
        " this link (is [[ inParen ]] and) (this) one is [[ notInParen ]]";

boolean insideParentheses = false;
int start = 0, end = 0;
for (int i = 0; i < data.length() - 1; i++) {
    if (data.charAt(i) == '(')
        insideParentheses = true;
    if (data.charAt(i) == ')')
        insideParentheses = false;
    // -> [[ and ]] inside Parentheses are not important
    if (!insideParentheses && 
            data.charAt(i) == '[' && data.charAt(i + 1) == '[') {
        start = i;
    }
    if (!insideParentheses && 
            data.charAt(i) == ']' && data.charAt(i + 1) == ']') {
        end = i;
        System.out.println(data.substring(start, end + 2));
    }
}

output

[[ notInParen ]]
[[ notInParen ]]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top