Beautiful Soup turns S&P into S&P; AT&T into AT&T; ?

Question

This appears to be a bug or feature in the way BeautifulSoup4 handles unknown HTML entity references. As Ignacio says in the comment above, it would be probably be better to pre-process the input and replace the '&' symbols with HTML entities ('&').

But if you don't want to do that for some reason - the only way I could only find a way to fix the problem was by "monkey-patching" the code. This script worked for me (Python 2.73 on Mac OS X):

import bs4

def my_handle_entityref(self, name):
     character = bs4.dammit.EntitySubstitution.HTML_ENTITY_TO_CHARACTER.get(name)
     if character is not None:
         data = character
     else:
         #the original code mishandles unknown entities (the following commented-out line)
         #data = "&%s;" % name
         data = "&%s" % name
     self.handle_data(data)

bs4.builder._htmlparser.BeautifulSoupHTMLParser.handle_entityref = my_handle_entityref
soup = bs4.BeautifulSoup('AT&T announces new plans')
print soup.text
soup = bs4.BeautifulSoup('AT&TOP announces new plans')
print soup.text

It produces the output:

AT&T announces new plans
AT&TOP announces new plans

You can see the method with the problem here:

http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/builder/_htmlparser.py#L81

And the line with the problem here:

http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/builder/_htmlparser.py#L86