Question

I do not understand the behavior. I have such example, need to catch html comment.

var str = '.. <!--My -- comment test--> ';

var regex1 = /<!--[.]*-->/g;
var regex2 = /<!--.*-->/g;

alert(str.match(regex1));      // null
alert(str.match(regex2));      // <!--My -- comment test--> 

The second regex regex2 works fine, outputs exactly what's needed. The first shows null. And I don't understand the difference. RegExpressions <!--[.]*--> and <!--.*--> mean the same - "after <!-- take ANY character except newline in quantity from 0 to as many as possible and finish with -->". But for the second it works and for the first does not. Why?

UPD. I've read comments and have an update.

var str3 = '.. <!--Mycommenttest--> ';
var str4 = '.. <!--My comment test--> ';

var regex3 = /<!--[\w]*-->/g;
var regex4 = /<!--[\s\S]*-->/g;

alert(str.match(regex3));         // <!--Mycommentstest-->
alert(str.match(regex4));         // <!-- My comment test -->

So it's possible to use limited matching variables to match anything. So which way should be used to use RegExps right way? With [] or without them? Can't get the difference, both give the right output.

Était-ce utile?

La solution

Character class shorthands like \w, \d and \s mean exactly the same inside character classes as out, but metacharacters like . typically lose their special meanings inside character classes. That's why /<!--[.]*-->/ didn't work as you expected: [.] matches a literal ..

But /<!--.*-->/ doesn't really work either, since . doesn't match newlines. In most regex flavors you would use single-line mode to let the dot match all characters including newlines, like this: /<!--.*-->/s or this: (?s)<!--.*-->. But JavaScript doesn't support that feature, so most people use [\s\S] instead, meaning "any whitespace character or any character that's not whitespace"--in other words, any character.

But that's not right either, since (as Jason pointed out in his comment) it will greedily match everything from the first <!-- to the last -->, which could encompass several individual comments and all the non-comment material between them. To make it truly correct is probably not worth the effort. When using regexes to match HTML, you have to make many simplifying assumptions anyway; if you can't assume a certain level of well-formedness, you might as well give up. In this case, it should suffice to make the quantifier non-greedy:

var regex5 = /<!--[\s\S]*?-->/g;

Autres conseils

The dot (.) does not mean "anything" inside a character class. Why would you need a character class to match anything?

RegExpressions <!--[.]*--> and <!--.*--> mean the same

This is not correct.

The brackets [] indicate a character class, where any character in the class may be matched. [.] is the character class which contains the '.' character. Contrast this with ., which is a pre-defined character class taken to mean "any character" (except for line-terminators).

So what you're matching with <!--[.]*--> is either an empty comment or a comment consisting wholly of '.' characters. And what you're matching with <!--.*--> is either an empty comment or a comment filled with any character except linebreaks.

The first doesn't because it doesn't mean the same. The first means to match the period character. The period character isn't a generic match when put inside of a [] set. (and if you think about it, this makes sense: why would you want to match anything inside a set of limited matching variables)

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top