문제

Dependencies: BeautifulSoup==3.2.1

In: from BeautifulSoup import BeautifulSoup
In: BeautifulSoup('<p><p>123</p></p>')
Out: <p></p><p>123</p>

Why are the two adjacent tags not in the output?

도움이 되었습니까?

해결책

That is just BS3's parser fixing your broken html.

The P element represents a paragraph. It cannot contain block-level elements (including P itself).

다른 팁

This

<p><p>123</p></p>

is not valid HTML. ps can't be nested. BS tries to clean it up.

When BS encounters the second <p> it thinks the first p is finished, so it inserts a closing </p>. The second </p> in your input then does not match an opening <p> so it is removed.

This is because BeautifulSoup has this NESTABLE_TAGS concept/setting:

When Beautiful Soup is parsing a document, it keeps a stack of open tags. Whenever it sees a new start tag, it tosses that tag on top of the stack. But before it does, it might close some of the open tags and remove them from the stack. Which tags it closes depends on the qualities of tag it just found, and the qualities of the tags in the stack.

So when Beautiful Soup encounters a <P> tag, it closes and pops all the tags up to and including the previously encountered tag of the same type. This is the default behavior, and this is how BeautifulStoneSoup treats every tag. It's what you get when a tag is not mentioned in either NESTABLE_TAGS or RESET_NESTING_TAGS. It's also what you get when a tag shows up in RESET_NESTING_TAGS but has no entry in NESTABLE_TAGS, the way the <P> tag does.

>>> pprint(BeautifulSoup.NESTABLE_TAGS)
{'bdo': [],
 'blockquote': [],
 'center': [],
 'dd': ['dl'],
 'del': [],
 'div': [],
 'dl': [],
 'dt': ['dl'],
 'fieldset': [],
 'font': [],
 'ins': [],
 'li': ['ul', 'ol'],
 'object': [],
 'ol': [],
 'q': [],
 'span': [],
 'sub': [],
 'sup': [],
 'table': [],
 'tbody': ['table'],
 'td': ['tr'],
 'tfoot': ['table'],
 'th': ['tr'],
 'thead': ['table'],
 'tr': ['table', 'tbody', 'tfoot', 'thead'],
 'ul': []}

As a workaround, you can allow p tag to be inside p:

>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup.NESTABLE_TAGS['p'] = ['p']
>>> BeautifulSoup('<p><p>123</p></p>')
<p><p>123</p></p>

Also, BeautifulSoup 3rd version is no longer maintained - you should switch to BeautifulSoup4.

When using BeautifulSoup4, you can change the underlying parser to change the behavior:

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup('<p><p>123</p></p>')
<html><body><p></p><p>123</p></body></html>
>>> BeautifulSoup('<p><p>123</p></p>', 'html.parser')
<p><p>123</p></p>
>>> BeautifulSoup('<p><p>123</p></p>', 'xml')
<?xml version="1.0" encoding="utf-8"?>
<p><p>123</p></p>
>>> BeautifulSoup('<p><p>123</p></p>', 'html5lib')
<html><head></head><body><p></p><p>123</p><p></p></body></html>
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top