Is “>” (U+003E GREATER-THAN SIGN) allowed inside an html-element attribute value?
Question
In other words may one use /<tag[^>]*>.*?<\/tag>/
regex to match the tag
html element which does not contain nested tag
elements?
For example (lt.html):
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>greater than sign in attribute value</title>
</head>
<body>
<div>1</div>
<div title=">">2</div>
</body>
</html>
Regex:
$ perl -nE"say $1 if m~<div[^>]*>(.*?)</div>~" lt.html
And screen-scraper:
#!/usr/bin/env python
import sys
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(sys.stdin)
for div in soup.findAll('div'):
print div.string
$ python lt.py <lt.html
Both give the same output:
1
">2
Expected output:
1
2
w3c says:
Attribute values are a mixture of text and character references, except with the additional restriction that the text cannot contain an ambiguous ampersand.
Solution
Yes, it is allowed (W3C Validator accepts it, only issues a warning).
Unescaped <
and >
are also allowed inside comments, so such simple regexp can be fooled.
If BeautifulSoup doesn't handle this, it could be a bug or perhaps a conscious design decision to make it more resilient to missing closing quotes in attributes.
OTHER TIPS
I believe that's valid, and the W3C validator agrees, but the authoritative source for this information is the ISO 8879:1986 standard, which costs ~150EUR/210USD. Regardless, it is not wrong to encode them, so if in doubt, encode. Additionally, if you are using an XML-based document type, you need to encode greater-than signs in the sequence ]]>
.
Literal >
is legal everywhere in html content, both inside attribute values and as text within an element.
After reading the following:
http://www.w3.org/International/questions/qa-escapes
it looks like entity escapes are suggested everywhere (including in attributes) for < > and &
If you insist on using regular expressions (which is appropriate for basic string operations) try using <tag((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)>.*?<\/tag>
. It should match attributes perfectly and therefore allowing you to access the inner content (although you need to put it in a capture group).
You may also use the Html Agility Pack for parsing HTML, which I would recommend if you are going to do a lot of parsing. Maintaining large regular expressions can easily become a headache, but in the meanwhile they are also much more effective if you are able to do so.
yeah except /<tag[^>]*>.*?<\/tag>/
Will not match a single tag, but match the first start-tag and the last end-tag for a given tag. Just like your first non-greedy tag-match, your in-between should be written non-greedy as well.
see if you get the same result using > instead of >