I need a regex that matches CDATA elements in html

https://stackoverflow.com/questions/21681861

09-10-2022
|

문제

I'm trying to write a regular expression to match CDATA elements in HTML in a web crawler class in c#.

What I have used in the past is : \<\!\[CDATA\[(?<text>[^\]]*)\]\]\> , but the problem is that this breaks in the presence of array [] elements if there is javascript contained within the CDATA tags. The negation is necessary because if there are multiple I want to match them all.

If I modify the regex to match the end '>' character I have the same problem. Any javascript with a > operator breaks my regex.

So I need to use a negative look-ahead within this regex to ignore ']]>'. How would I write this?

Here's some test data for a quick setup of the problem:

        //Matches any
        string pattern = @"\<\!\[CDATA\[(?<text>[^\]]*)\]\]\>";
        var rx = new Regex(pattern, RegexOptions.Singleline);

        /* Testing...*/

         string eg = @"<![CDATA[TesteyMcTest//]]><![CDATA[TesteyMcTest2//]]><![CDATA[TesteyMcTest//]]><!             [CDATA[TesteyMcTest2//]]>
         <![CDATA[Thisisal3ongarbi4trarys6testwithnumbers//]]><![CDATA             [thisisalo4ngarbitrarytest6withumbers123456//]]><![CDATA[ this.exec = (function(){ var x =              this.GetFakeArray(); var y = x[0]; return y > 3;});//]]> ";

         var mz = rx.Matches(eg);

This example matches every instance of CDATA except for the last one, which contains javascript and ']', '>'

Thanks in advance,

해결책

The problem is that your <text> subpattern is false! You don't need to avoid ], you need to avoid ] followed by ]>. You can use this subpattern instead:

(?<text>(?>[^]]+|](?!]>))*)

the whole pattern: (note that many characters don't need to be escaped)

@"<!\s*\[CDATA\s*\[(?<text>(?>[^]]+|](?!]>))*)]]>"

I added two \s* to match all your example strings, but if you want to disallow these optional spaces, you can remove the \s*.

다른 팁

Does the following work for you: http://regex101.com/r/cT0pT0

\[CDATA\[(.*?)\]\]>

It seems to match what you are asking for... Key here is that the use of .*? (non greedy match) stops on the first occasion that you get ]]>

NOTE - it is usually a REALLY BAD IDEA to use regex for parsing HTML. There are plenty of good libraries available to do the job far more robustly.

See for example What is the best way to parse html in C#?

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow