Frage

I'm parsing some rather messy HTML documents using BeautifulSoup 4 (4.3.2) and am running into a problem where it'll turn a company name like S&P (Standard and Poors) or M&S (Marks and Spencer) AT&T into S&P;, M&S; and AT&T;. So it wants to complete the &[A-Z]+ pattern into an html entity, but doesn't actually use an html entity lookup table since &P; is not an html entity.

How do I make it not do that, or do I just need to regex match the invalid entities and change them back?

>>> import bs4
>>> soup = bs4.BeautifulSoup('AT&T announces new plans')
>>> soup.text
u'AT&T; announces new plans'

>>> import bs4
>>> soup = bs4.BeautifulSoup('AT&TOP announces new plans')
>>> soup.text
u'AT⊤ announces new plans'

I've tried the above on OSX 10.8.5 Python 2.7.5 and Scientifix Linux 6 Python 2.7.5

War es hilfreich?

Lösung

This appears to be a bug or feature in the way BeautifulSoup4 handles unknown HTML entity references. As Ignacio says in the comment above, it would be probably be better to pre-process the input and replace the '&' symbols with HTML entities ('&').

But if you don't want to do that for some reason - the only way I could only find a way to fix the problem was by "monkey-patching" the code. This script worked for me (Python 2.73 on Mac OS X):

import bs4

def my_handle_entityref(self, name):
     character = bs4.dammit.EntitySubstitution.HTML_ENTITY_TO_CHARACTER.get(name)
     if character is not None:
         data = character
     else:
         #the original code mishandles unknown entities (the following commented-out line)
         #data = "&%s;" % name
         data = "&%s" % name
     self.handle_data(data)

bs4.builder._htmlparser.BeautifulSoupHTMLParser.handle_entityref = my_handle_entityref
soup = bs4.BeautifulSoup('AT&T announces new plans')
print soup.text
soup = bs4.BeautifulSoup('AT&TOP announces new plans')
print soup.text

It produces the output:

AT&T announces new plans
AT&TOP announces new plans

You can see the method with the problem here:

http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/builder/_htmlparser.py#L81

And the line with the problem here:

http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/builder/_htmlparser.py#L86

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top