Why doesn't the RegExp “greedy” mode work?
-
22-04-2021 - |
Frage
I do not understand the behavior. I have such example, need to catch html comment.
var str = '.. <!--My -- comment test--> ';
var regex1 = /<!--[.]*-->/g;
var regex2 = /<!--.*-->/g;
alert(str.match(regex1)); // null
alert(str.match(regex2)); // <!--My -- comment test-->
The second regex regex2
works fine, outputs exactly what's needed. The first shows null
. And I don't understand the difference. RegExpressions <!--[.]*-->
and <!--.*-->
mean the same - "after <!--
take ANY character except newline in quantity from 0 to as many as possible and finish with -->
". But for the second it works and for the first does not. Why?
UPD. I've read comments and have an update.
var str3 = '.. <!--Mycommenttest--> ';
var str4 = '.. <!--My comment test--> ';
var regex3 = /<!--[\w]*-->/g;
var regex4 = /<!--[\s\S]*-->/g;
alert(str.match(regex3)); // <!--Mycommentstest-->
alert(str.match(regex4)); // <!-- My comment test -->
So it's possible to use limited matching variables to match anything. So which way should be used to use RegExps right way? With []
or without them? Can't get the difference, both give the right output.
Lösung
Character class shorthands like \w
, \d
and \s
mean exactly the same inside character classes as out, but metacharacters like .
typically lose their special meanings inside character classes. That's why /<!--[.]*-->/
didn't work as you expected: [.]
matches a literal .
.
But /<!--.*-->/
doesn't really work either, since .
doesn't match newlines. In most regex flavors you would use single-line mode to let the dot match all characters including newlines, like this: /<!--.*-->/s
or this: (?s)<!--.*-->
. But JavaScript doesn't support that feature, so most people use [\s\S]
instead, meaning "any whitespace character or any character that's not whitespace"--in other words, any character.
But that's not right either, since (as Jason pointed out in his comment) it will greedily match everything from the first <!--
to the last -->
, which could encompass several individual comments and all the non-comment material between them. To make it truly correct is probably not worth the effort. When using regexes to match HTML, you have to make many simplifying assumptions anyway; if you can't assume a certain level of well-formedness, you might as well give up. In this case, it should suffice to make the quantifier non-greedy:
var regex5 = /<!--[\s\S]*?-->/g;
Andere Tipps
The dot (.
) does not mean "anything" inside a character class. Why would you need a character class to match anything?
RegExpressions
<!--[.]*-->
and<!--.*-->
mean the same
This is not correct.
The brackets []
indicate a character class, where any character in the class may be matched. [.]
is the character class which contains the '.
' character. Contrast this with .
, which is a pre-defined character class taken to mean "any character" (except for line-terminators).
So what you're matching with <!--[.]*-->
is either an empty comment or a comment consisting wholly of '.
' characters. And what you're matching with <!--.*-->
is either an empty comment or a comment filled with any character except linebreaks.
The first doesn't because it doesn't mean the same. The first means to match the period character. The period character isn't a generic match when put inside of a [] set. (and if you think about it, this makes sense: why would you want to match anything inside a set of limited matching variables)