Why doesn't the RegExp “greedy” mode work?

https://stackoverflow.com/questions/9133091

22-04-2021
|

Question

I do not understand the behavior. I have such example, need to catch html comment.

var str = '.. <!--My -- comment test--> ';

var regex1 = /<!--[.]*-->/g;
var regex2 = /<!--.*-->/g;

alert(str.match(regex1));      // null
alert(str.match(regex2));      // <!--My -- comment test-->

The second regex regex2 works fine, outputs exactly what's needed. The first shows null. And I don't understand the difference. RegExpressions  and  mean the same - "after ". But for the second it works and for the first does not. Why?

UPD. I've read comments and have an update.

var str3 = '.. <!--Mycommenttest--> ';
var str4 = '.. <!--My comment test--> ';

var regex3 = /<!--[\w]*-->/g;
var regex4 = /<!--[\s\S]*-->/g;

alert(str.match(regex3));         // <!--Mycommentstest-->
alert(str.match(regex4));         // <!-- My comment test -->

So it's possible to use limited matching variables to match anything. So which way should be used to use RegExps right way? With [] or without them? Can't get the difference, both give the right output.

La solution

Character class shorthands like \w, \d and \s mean exactly the same inside character classes as out, but metacharacters like . typically lose their special meanings inside character classes. That's why // didn't work as you expected: [.] matches a literal ..

But // doesn't really work either, since . doesn't match newlines. In most regex flavors you would use single-line mode to let the dot match all characters including newlines, like this: //s or this: (?s). But JavaScript doesn't support that feature, so most people use [\s\S] instead, meaning "any whitespace character or any character that's not whitespace"--in other words, any character.

But that's not right either, since (as Jason pointed out in his comment) it will greedily match everything from the first , which could encompass several individual comments and all the non-comment material between them. To make it truly correct is probably not worth the effort. When using regexes to match HTML, you have to make many simplifying assumptions anyway; if you can't assume a certain level of well-formedness, you might as well give up. In this case, it should suffice to make the quantifier non-greedy:

var regex5 = /<!--[\s\S]*?-->/g;

Autres conseils

The dot (.) does not mean "anything" inside a character class. Why would you need a character class to match anything?

RegExpressions  and  mean the same

This is not correct.

The brackets [] indicate a character class, where any character in the class may be matched. [.] is the character class which contains the '.' character. Contrast this with ., which is a pre-defined character class taken to mean "any character" (except for line-terminators).

So what you're matching with  is either an empty comment or a comment consisting wholly of '.' characters. And what you're matching with  is either an empty comment or a comment filled with any character except linebreaks.

The first doesn't because it doesn't mean the same. The first means to match the period character. The period character isn't a generic match when put inside of a [] set. (and if you think about it, this makes sense: why would you want to match anything inside a set of limited matching variables)

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow