Find text in square brackets but not in parentheses
-
12-06-2021 - |
Pergunta
If I have a string like this (from a Wiki-markup) that I need to parse in Java:
this link (is [[ inParen ]] and) (this) one is [[ notInParen ]]
I'd like to use regex to extract the texts inside the [[ ]] but not if they are inside parentheses. For example, in the example above it should return:
notInParen
But ignore:
inParen and this
... since they are inside parentheses. I can find the parentheses and the brackets separately no problem:
.*\(.*?\).* and .*?\[\[(.*?\]\].*
...but can't figure out how to find the [[ ]], look around for parentheses, and ignore. Thanks!
Solução
This is a fine regex
\(.*?\)|\[\[(.*?)]]
Your desired match will be in group 1
FYI, to make it better perform you can minimize backtracking by replacing the lazy match with a negated character class.
In Java this becomes
String ResultString = null;
try {
Pattern regex = Pattern.compile("\\(.*?\\)|\\[\\[(.*?)\\]\\]", Pattern.DOTALL | Pattern.MULTILINE);
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
ResultString = regexMatcher.group(1);
}
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
Note that group 1 will be empty for the cases the first part of the alternation did match.
Outras dicas
Does it need to be done in one go? You can do:
- Parse the string and remove all substrings contained in parentheses.
- Parse the result again and take all the desired Wikipedia links with
[[
and]]
.
This solves the problem and makes the problem easier to solve.
After step 1 you have: this link one is [[ notInParen ]]
.
After step 2 you have: notInParen
.
You can also do it this way
String data = "this link (is [[ inParen ]] and) (this) one is [[ notInParen ]]" +
" this link (is [[ inParen ]] and) (this) one is [[ notInParen ]]";
boolean insideParentheses = false;
int start = 0, end = 0;
for (int i = 0; i < data.length() - 1; i++) {
if (data.charAt(i) == '(')
insideParentheses = true;
if (data.charAt(i) == ')')
insideParentheses = false;
// -> [[ and ]] inside Parentheses are not important
if (!insideParentheses &&
data.charAt(i) == '[' && data.charAt(i + 1) == '[') {
start = i;
}
if (!insideParentheses &&
data.charAt(i) == ']' && data.charAt(i + 1) == ']') {
end = i;
System.out.println(data.substring(start, end + 2));
}
}
output
[[ notInParen ]]
[[ notInParen ]]