Is “>” (U+003E GREATER-THAN SIGN) allowed inside an html-element attribute value?

StackOverflow https://stackoverflow.com/questions/94528

  •  01-07-2019
  •  | 
  •  

Question

In other words may one use /<tag[^>]*>.*?<\/tag>/ regex to match the tag html element which does not contain nested tag elements?

For example (lt.html):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
  <head>
    <title>greater than sign in attribute value</title>
  </head>
  <body>
    <div>1</div>
    <div title=">">2</div>
  </body>
</html>

Regex:

$ perl -nE"say $1 if m~<div[^>]*>(.*?)</div>~" lt.html

And screen-scraper:

#!/usr/bin/env python
import sys
import BeautifulSoup

soup = BeautifulSoup.BeautifulSoup(sys.stdin)
for div in soup.findAll('div'):
    print div.string


$ python lt.py <lt.html

Both give the same output:

1
">2

Expected output:

1
2

w3c says:

Attribute values are a mixture of text and character references, except with the additional restriction that the text cannot contain an ambiguous ampersand.

Was it helpful?

Solution

Yes, it is allowed (W3C Validator accepts it, only issues a warning).

Unescaped < and > are also allowed inside comments, so such simple regexp can be fooled.

If BeautifulSoup doesn't handle this, it could be a bug or perhaps a conscious design decision to make it more resilient to missing closing quotes in attributes.

OTHER TIPS

I believe that's valid, and the W3C validator agrees, but the authoritative source for this information is the ISO 8879:1986 standard, which costs ~150EUR/210USD. Regardless, it is not wrong to encode them, so if in doubt, encode. Additionally, if you are using an XML-based document type, you need to encode greater-than signs in the sequence ]]>.

Literal > is legal everywhere in html content, both inside attribute values and as text within an element.

After reading the following:

http://www.w3.org/International/questions/qa-escapes

it looks like entity escapes are suggested everywhere (including in attributes) for < > and &

If you insist on using regular expressions (which is appropriate for basic string operations) try using <tag((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)>.*?<\/tag>. It should match attributes perfectly and therefore allowing you to access the inner content (although you need to put it in a capture group).

You may also use the Html Agility Pack for parsing HTML, which I would recommend if you are going to do a lot of parsing. Maintaining large regular expressions can easily become a headache, but in the meanwhile they are also much more effective if you are able to do so.

yeah except /<tag[^>]*>.*?<\/tag>/

Will not match a single tag, but match the first start-tag and the last end-tag for a given tag. Just like your first non-greedy tag-match, your in-between should be written non-greedy as well.

see if you get the same result using &gt; instead of >

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top