Question

I noticed that when I get HTML from the web with Beautiful Soup it somehow changes. This is code that I am using to get it:

from bs4 import BeautifulSoup
import requests
url ="http://www.basketnews.lt/lygos/59-nacionaline-krepsinio-asociacija/2013/naujienos.html"
r = requests.get(url)
soup = BeautifulSoup(r.text)
print soup

Here is part of original HTML:

<a href="/news-73149-valanciunui-ir-raptors-sezonas-baigtas-foto-statistika.html">Valančiūnui ir Raptors sezonas baigtas <span class="title_description">(foto, statistika)</span></a>`

Here is the same part of HTML that get with Beautiful Soup:

<a href="/news-73149-valanciunui-ir-raptors-sezonas-baigtas-foto-statistika.html">ValanÄiÅ«nui ir âRaptorsâ sezonas baigtas <span class="title_description">(foto, statistika)</span></a>

You see how text is messed up in the HTML that I am parsing. Where is the problem?

Was it helpful?

Solution

You are using r.text, which means that requests will use a default encoding; it gets it wrong however:

>>> r = requests.get(url)
>>> r.headers['content-type']
'text/html'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'utf-8'

ISO-8859-1 (Latin 1) is the HTTP 1.1 default encoding for text/ responses.

When using a detection algorithm, UTF-8 is found.

You shouldn't be using r.text but use r.content instead, leave it to BeautifulSoup to do the detection:

soup = BeautifulSoup(r.content)

Now it works correctly:

>>> soup = BeautifulSoup(r.content)
>>> soup.find('a', href='/news-73149-valanciunui-ir-raptors-sezonas-baigtas-foto-statistika.html')
<a href="/news-73149-valanciunui-ir-raptors-sezonas-baigtas-foto-statistika.html">Valančiūnui ir „Raptors“ sezonas baigtas <span class="title_description">(foto, statistika)</span></a>
>>> print soup.find('a', href='/news-73149-valanciunui-ir-raptors-sezonas-baigtas-foto-statistika.html').text
Valančiūnui ir „Raptors“ sezonas baigtas (foto, statistika)

BeautifulSoup also uses auto-detection but in this case it'll find the <meta> header with the right encoding in the page:

>>> soup.find('meta', {'http-equiv': 'content-type'})
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top