BeautifulSoup fails to parse nested elements

Question 1

That is just BS3's parser fixing your broken html.

The P element represents a paragraph. It cannot contain block-level elements (including P itself).

Question 2

This

<p><p>123</p></p>

is not valid HTML. ps can't be nested. BS tries to clean it up.

When BS encounters the second  it thinks the first p is finished, so it inserts a closing . The second  in your input then does not match an opening  so it is removed.

Question 3

This is because BeautifulSoup has this NESTABLE_TAGS concept/setting:

When Beautiful Soup is parsing a document, it keeps a stack of open tags. Whenever it sees a new start tag, it tosses that tag on top of the stack. But before it does, it might close some of the open tags and remove them from the stack. Which tags it closes depends on the qualities of tag it just found, and the qualities of the tags in the stack.

So when Beautiful Soup encounters a  tag, it closes and pops all the tags up to and including the previously encountered tag of the same type. This is the default behavior, and this is how BeautifulStoneSoup treats every tag. It's what you get when a tag is not mentioned in either NESTABLE_TAGS or RESET_NESTING_TAGS. It's also what you get when a tag shows up in RESET_NESTING_TAGS but has no entry in NESTABLE_TAGS, the way the  tag does.

>>> pprint(BeautifulSoup.NESTABLE_TAGS)
{'bdo': [],
 'blockquote': [],
 'center': [],
 'dd': ['dl'],
 'del': [],
 'div': [],
 'dl': [],
 'dt': ['dl'],
 'fieldset': [],
 'font': [],
 'ins': [],
 'li': ['ul', 'ol'],
 'object': [],
 'ol': [],
 'q': [],
 'span': [],
 'sub': [],
 'sup': [],
 'table': [],
 'tbody': ['table'],
 'td': ['tr'],
 'tfoot': ['table'],
 'th': ['tr'],
 'thead': ['table'],
 'tr': ['table', 'tbody', 'tfoot', 'thead'],
 'ul': []}

As a workaround, you can allow p tag to be inside p:

>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup.NESTABLE_TAGS['p'] = ['p']
>>> BeautifulSoup('<p><p>123</p></p>')
<p><p>123</p></p>

Also, BeautifulSoup 3rd version is no longer maintained - you should switch to BeautifulSoup4.

When using BeautifulSoup4, you can change the underlying parser to change the behavior:

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup('<p><p>123</p></p>')
<html><body><p></p><p>123</p></body></html>
>>> BeautifulSoup('<p><p>123</p></p>', 'html.parser')
<p><p>123</p></p>
>>> BeautifulSoup('<p><p>123</p></p>', 'xml')
<?xml version="1.0" encoding="utf-8"?>
<p><p>123</p></p>
>>> BeautifulSoup('<p><p>123</p></p>', 'html5lib')
<html><head></head><body><p></p><p>123</p><p></p></body></html>

BeautifulSoup fails to parse nested <p> elements