What am I doing wrong with my regex?

https://stackoverflow.com/questions/3175598

02-10-2019
|

Question

I am trying to capture "Rio Grande Do Leste" from:

...
<h1>Rio Grande Do Leste<br />
...

using

var myregexp = /<h1>()<br/;

var nomeAldeiaDoAtaque = myregexp.exec(document);

what am I doing wrong?

update:

2 questions remain:

1) searching (document) didn´t produce any result, but changing it to (document.body.innerHTML) worked. Why is that?

2) I had to change it to: myregexp.exec(document.body.innerHTML)[1]; to get what I want, otherwise it would give me some result which includes <h1>. why is that?

3) (answered) why do I need to use ".*" ? I tought it would collect anything between ()?

Solution

Try /<h1>(.*?)<br/.

OTHER TIPS

On capturing group

A capturing group attempts to capture what it matches. This has some important consequences:

A group that matches nothing, can never capture anything.
A group that only matches an empty string, can only capture an empty string.
A group that captures repeatedly in a match attempt only gets to keep the last capture
- Generally true for most flavors, but .NET regex is an exception (see related question)

Here's a simple pattern that contains 2 capturing groups:

(\d+) (cats|dogs)
\___/ \_________/
  1        2

Given i have 16 cats, 20 dogs, and 13 turtles, there are 2 matches (as seen on rubular.com):

16 cats is a match: group 1 captures 16, group 2 captures cats
20 dogs is a match: group 1 captures 20, group 2 captures dogs

Now consider this slight modification on the pattern:

(\d)+ (cats|dogs)
\__/  \_________/
 1         2

Now group 1 matches \d, i.e. a single digit. In most flavor, a group that matches repeatedly (thanks to the + in this case) only gets to keep the last match. Thus, in most flavors, only the last digit that was matched is captured by group 1 (as seen on rubular.com):

16 cats is a match: group 1 captures 6, group 2 captures cats
20 dogs is a match: group 1 captures 0, group 2 captures dogs

References

regular-expressions.info/Use Round Brackets for Capturing
Is there a regex flavor that allows me to count the number of repetitions matched by * and +?
- .NET regex keeps intermediate captures!

On greedy vs reluctant vs negated character class

Now let's consider the problem of matching "everything between A and ZZ". As it turns out, this specification is ambiguous: we will come up with 3 patterns that does this, and they will yield different matches. Which one is "correct" depends on the expectation, which is not properly conveyed in the original statement.

We use the following as input:

eeAiiZooAuuZZeeeZZfff

We use 3 different patterns:

A(.*)ZZ yields 1 match: AiiZooAuuZZeeeZZ (as seen on ideone.com)
- This is the greedy variant; group 1 matched and captured iiZooAuuZZeee
A(.*?)ZZ yields 1 match: AiiZooAuuZZ (as seen on ideone.com)
- This is the reluctant variant; group 1 matched and captured iiZooAuu
A([^Z]*)ZZ yields 1 match: AuuZZ (as seen on ideone.com)
- This is the negated character class variant; group 1 matched and captured uu

Here's a visual representation of what they matched:

         ___n
        /   \              n = negated character class
eeAiiZooAuuZZeeeZZfff      r = reluctant
  \_________/r   /         g = greedy
   \____________/g

See related question for a more in-depth treatment on the difference between these 3 techniques.

Going back to the question

So let's go back to the question and see what's wrong with pattern:

<h1>()<br
    \/
     1

Group 1 matches the empty string, therefore the whole pattern overall can only match <hr1><br, and group 1 can only match the empty string.

One can try to "fix" this in many different ways. The 3 obvious ones to try are:

<h1>(.*)<br; greedy
<h1>(.*?)<br; reluctant
<h1>([^<]*)<br; negated character class

You will find that none of the above "work" all the time; there will be problems with some HTML. This is to be expected: regex is the "wrong" tool for the job. You can try to make the pattern more and more complicated, to get it "right" more often and "wrong" less often. More than likely you'll end up with a horrible mess that no one can understand and/or maintain, and it'd still probably won't work "right" 100% of the time.

^(<h1>)(.)+(<br />)

go here to test gskinner.com

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow

What am I doing wrong with my regex?

On capturing group

References

On greedy vs reluctant vs negated character class

Related questions

Going back to the question