What am I doing wrong with my regex?
-
02-10-2019 - |
Question
I am trying to capture "Rio Grande Do Leste" from:
...
<h1>Rio Grande Do Leste<br />
...
using
var myregexp = /<h1>()<br/;
var nomeAldeiaDoAtaque = myregexp.exec(document);
what am I doing wrong?
update:
2 questions remain:
1) searching (document) didn´t produce any result, but changing it to (document.body.innerHTML) worked. Why is that?
2) I had to change it to: myregexp.exec(document.body.innerHTML)[1]; to get what I want, otherwise it would give me some result which includes <h1>
. why is that?
3) (answered) why do I need to use ".*" ? I tought it would collect anything between ()?
Solution
Try /<h1>(.*?)<br/
.
OTHER TIPS
On capturing group
A capturing group attempts to capture what it matches. This has some important consequences:
- A group that matches nothing, can never capture anything.
- A group that only matches an empty string, can only capture an empty string.
- A group that captures repeatedly in a match attempt only gets to keep the last capture
- Generally true for most flavors, but .NET regex is an exception (see related question)
Here's a simple pattern that contains 2 capturing groups:
(\d+) (cats|dogs)
\___/ \_________/
1 2
Given i have 16 cats, 20 dogs, and 13 turtles
, there are 2 matches (as seen on rubular.com):
16 cats
is a match: group 1 captures16
, group 2 capturescats
20 dogs
is a match: group 1 captures20
, group 2 capturesdogs
Now consider this slight modification on the pattern:
(\d)+ (cats|dogs)
\__/ \_________/
1 2
Now group 1 matches \d
, i.e. a single digit. In most flavor, a group that matches repeatedly (thanks to the +
in this case) only gets to keep the last match. Thus, in most flavors, only the last digit that was matched is captured by group 1 (as seen on rubular.com):
16 cats
is a match: group 1 captures6
, group 2 capturescats
20 dogs
is a match: group 1 captures0
, group 2 capturesdogs
References
- regular-expressions.info/Use Round Brackets for Capturing
- Is there a regex flavor that allows me to count the number of repetitions matched by * and +?
- .NET regex keeps intermediate captures!
On greedy vs reluctant vs negated character class
Now let's consider the problem of matching "everything between A
and ZZ
". As it turns out, this specification is ambiguous: we will come up with 3 patterns that does this, and they will yield different matches. Which one is "correct" depends on the expectation, which is not properly conveyed in the original statement.
We use the following as input:
eeAiiZooAuuZZeeeZZfff
We use 3 different patterns:
A(.*)ZZ
yields 1 match:AiiZooAuuZZeeeZZ
(as seen on ideone.com)- This is the greedy variant; group 1 matched and captured
iiZooAuuZZeee
- This is the greedy variant; group 1 matched and captured
A(.*?)ZZ
yields 1 match:AiiZooAuuZZ
(as seen on ideone.com)- This is the reluctant variant; group 1 matched and captured
iiZooAuu
- This is the reluctant variant; group 1 matched and captured
A([^Z]*)ZZ
yields 1 match:AuuZZ
(as seen on ideone.com)- This is the negated character class variant; group 1 matched and captured
uu
- This is the negated character class variant; group 1 matched and captured
Here's a visual representation of what they matched:
___n
/ \ n = negated character class
eeAiiZooAuuZZeeeZZfff r = reluctant
\_________/r / g = greedy
\____________/g
See related question for a more in-depth treatment on the difference between these 3 techniques.
Related questions
- Difference between
.*?
and.*
for regex- Greedy vs reluctant vs negated character class, detailed explanation with illustrative examples
Going back to the question
So let's go back to the question and see what's wrong with pattern:
<h1>()<br
\/
1
Group 1 matches the empty string, therefore the whole pattern overall can only match <hr1><br
, and group 1 can only match the empty string.
One can try to "fix" this in many different ways. The 3 obvious ones to try are:
<h1>(.*)<br
; greedy<h1>(.*?)<br
; reluctant<h1>([^<]*)<br
; negated character class
You will find that none of the above "work" all the time; there will be problems with some HTML. This is to be expected: regex is the "wrong" tool for the job. You can try to make the pattern more and more complicated, to get it "right" more often and "wrong" less often. More than likely you'll end up with a horrible mess that no one can understand and/or maintain, and it'd still probably won't work "right" 100% of the time.