You are using r.text
, which means that requests
will use a default encoding; it gets it wrong however:
>>> r = requests.get(url)
>>> r.headers['content-type']
'text/html'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'utf-8'
ISO-8859-1 (Latin 1) is the HTTP 1.1 default encoding for text/
responses.
When using a detection algorithm, UTF-8 is found.
You shouldn't be using r.text
but use r.content
instead, leave it to BeautifulSoup to do the detection:
soup = BeautifulSoup(r.content)
Now it works correctly:
>>> soup = BeautifulSoup(r.content)
>>> soup.find('a', href='/news-73149-valanciunui-ir-raptors-sezonas-baigtas-foto-statistika.html')
<a href="/news-73149-valanciunui-ir-raptors-sezonas-baigtas-foto-statistika.html">Valančiūnui ir „Raptors“ sezonas baigtas <span class="title_description">(foto, statistika)</span></a>
>>> print soup.find('a', href='/news-73149-valanciunui-ir-raptors-sezonas-baigtas-foto-statistika.html').text
Valančiūnui ir „Raptors“ sezonas baigtas (foto, statistika)
BeautifulSoup also uses auto-detection but in this case it'll find the <meta>
header with the right encoding in the page:
>>> soup.find('meta', {'http-equiv': 'content-type'})
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>