Getting clean text from text/html documents using BeautifulSoup

https://stackoverflow.com/questions/9761720

24-05-2021
|

Question

I have a document that has two content types: text/xml and text/html. I would like to use BeautifulSoup to parse the document and end up with a clean text version. The document starts as a tuple, so I have been using repr to turn it into something BeautifulSoup recognizes, and then using find_all to find just the text/html bit of the document by searching for the divs, like so:

soup = BeautifulSoup(repr(msg_data))
text = soup.html.find_all("div")

Then, I'm turning text back into a string, saving it to a variable and then turning it back into a soup object and calling get_text on it, like so:

str_text = str(text)
soup_text = BeautifulSoup(str_text)
soup_text.get_text()

However, that then changes the encoding to unicode, like so:

u'[9:16 PM\xa0Erica: with images, \xa0\xa0and that seemed long to me anyway, 9:17     
PM\xa0me: yeah, \xa0Erica: so feel free to make it shorter, \xa0\xa0or rather, please do, 
9:18 PM\xa0nobody wants to read about that shit for 2 pages, \xa0me: :), \xa0Erica: while 
browsing their site, \xa0me: srsly, \xa0Erica: unless of course your writing is magic, 
\xa0me: My writing saves drowning puppies, \xa0\xa0Just plucks him right out and gives 
them a scratch behind the ears and some kibble, \xa0Erica: Maine is weird, \xa0me: haha]'

When I try to re-encode it as UTF-8, like so:

soup.encode('utf-8')

I am back to the unparsed type.

I would like to get to the point where I have clean text saved as a string then I can find specific things within the text (like, example, "puppies" in the text above).

Basically, I'm running around in circles here. Can anyone help? As always, thank you so much for any help you can give.

Solution

The encoding isn't ruined; it's exactly what it should be. '\xa0' is Unicode for a non-breaking space.

If you want to encode this (Unicode) string as ASCII, you can tell the codec to ignore any character it doesn't understand:

>>> x = u'[9:16 PM\xa0Erica: with images, \xa0\xa0and that seemed long to me anyway, 9:17 PM\xa0me: yeah, \xa0Erica: so feel free to make it shorter, \xa0\xa0or rather, please do,  9:18 PM\xa0nobody wants to read about that shit for 2 pages, \xa0me: :), \xa0Erica: while  browsing their site, \xa0me: srsly, \xa0Erica: unless of course your writing is magic,  \xa0me: My writing saves drowning puppies, \xa0\xa0Just plucks him right out and gives  them a scratch behind the ears and some kibble, \xa0Erica: Maine is weird, \xa0me: haha]'
>>> x.encode('ascii', 'ignore')
'[9:16 PMErica: with images, and that seemed long to me anyway, 9:17 PMme: yeah, Erica: so feel free to make it shorter, or rather, please do,  9:18 PMnobody wants to read about that shit for 2 pages, me: :), Erica: while  browsing their site, me: srsly, Erica: unless of course your writing is magic,  me: My writing saves drowning puppies, Just plucks him right out and gives  them a scratch behind the ears and some kibble, Erica: Maine is weird, me: haha]'

If you have time, you should watch Ned Batchelder's recent video Pragmatic Unicode. It will make everything clear and simple!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow