Question

I am trying to write a regex that matches entire contents of a tag, minus any leading or trailing whitespace. Here is a boiled-down example of the input:

<tag> text </tag>

I want only the following to be matched (note how the whitespace before and after the match has been trimmed):

"text"

I am currently trying to use this regex in .NET (Powershell):

(?<=<tag>(\s)*).*?(?=(\s)*</tag>)

However, this regex matches "text" plus the leading whitespace inside of the tag, which is undesired. How can I fix my regex to work as expected?

Was it helpful?

Solution

Drop the lookarounds; they just make the job more complicated than it needs to be. Instead, use a capturing group to pick out the part you want:

<tag>\s*(.*?)\s*</tag>

The part you want is available as $matches[1].

OTHER TIPS

You should not use regext to parse html.

Use a parser instead.

Also: Regex to remove body tag attributes (C#)

Also also: RegEx match open tags except XHTML self-contained tags

If all that doesn't convince you, then don't use the dot in the middle of your expression. Use the alphanumeric escape. Your dot is consuming whitespace. Use \w (I think) instead.

Use these regular expressions to strip trailing and leading whitespaces. /^\s+/ and /\s+$/

        test = "<tag>     test    </tag>";
        string pattern3 = @"<tag>(.*?)</tag>";
        Console.WriteLine("{0}", Regex.Match(test,pattern3).Groups[1].Value.Trim());
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top