Pergunta

I want to match mail addresses in a string. That's no problem. But for any reason, i fail on excluding special html tags and attributes.

My mail regex:

[!#\$%&'\*\+\-\/0-9=\?a-z\^_`\{\}\|~]*(?:\\[\x00-\x7F][!#\$%&'\*\+\-\/0-9=\?a-z\^_`\{\}\|~]*)*(?:\.[!#\$%&'\*\+\-\/0-9=\?a-z\^_`\{\}\|~]*(?:\\[\x00-\x7F][!#\$%&'\*\+\-\/0-9=\?a-z\^_`\{\}\|~]*)*)*@[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)*\.[a-z]{2,}

Now, i dont want to match, if the mail address is within an input field:

<input type="xxx" value"foo@bar.tld">

I also dont want to match, if it's in the title tag

<title>foo@bar.tld

nor if it's contained in <style and <script

I tried this look ahead thing, but i produce illegal regular expressions or it just doesnt work.

Foi útil?

Solução

One regular expression is not going to be able to exclude and include simultaneously in the way you want.

If your target document is well-formed XML then you could use one or more regular expressions to find and replace tags with the empty string, then use your working regex to find mail addresses in whatever text is left.

However, I have to agree with Bohemian that an XML parser is the best way to go, if your target is an XML file. XML is complex and flexible, and there's always the risk that you'll hit a file with features you forgot about when designing your replace-with-empty-string regex, such as CDATA and comment blocks. Best to stick with a parser which is designed and tested for running through XML and extracting the document part by part.

If your target document is unruly HTML which an XML parser can't read, then you may have to try the replace-then-search method.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top