Python 3: Can't properly encode and print a downloaded string with /xXX literals

https://stackoverflow.com/questions/23416691

13-07-2023
|

سؤال

So here's the problem. I want to, for example, download and print a list of all possible languages from

https://www.fanfiction.net/game/Pok%C3%A9mon/

(Visible under 'filters' button).

In HTML, it's represented as a following series of options:

<option value='17' >Svenska<option value='31' >čeština<option value='10' >Русский
<option value='39' >देवनागरी<option value='38' >ภาษาไทย<option value='5' >中文<option value='6' >日本語

I download it using urllib.request package

def getByUrl(self,url):
    response = urllib.request.urlopen(url)
    html = response.read()
    return html

and then, I try to display it like this:

@staticmethod
def fromCollection_getPossibleLanguages(self,pageContent):
        parsedHtml = BeautifulSoup(pageContent)
        possibleMatches = parsedHtml.findAll('select',{'name':'languageid','class':'filter_select'})
        possibleMatches = possibleMatches[0].findAll('option')

        for match in possibleMatches:
            print(str(match.text.encode('unicode')) + " - " + str(match.get('value')))

However, all my attempts to play with .encode() function(e.g. passing a 'utf-8' or 'unicode' args) have failed to display anything more than, for example:

b'\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9' - 10

I'm displaying it in mac os x's terminal and in Eclipse's console view - same result

المحلول

You don't need to encode at all. BeautifulSoup has already decoded the response bytes to Unicode values, and print() can take care of the rest here.

However, the page is malformed, as there are no closing </option> tags. This can confuse the standard HTML parser. Install lxml or the html5lib package, and the page can be parsed correctly:

parsedHtml = BeautifulSoup(pageContent, 'lxml')

parsedHtml = BeautifulSoup(pageContent, 'html5lib')

Next, you can select the <option> tags with one CSS selector:

possibleMatches = parsedHtml.select('select[name=languageid] option')

for match in possibleMatches:
    print(match.text, "-", match.get('value'))

Demo:

>>> possibleMatches = soup.select('select[name=languageid] option')
>>> for match in possibleMatches:
...     print(match.text, "-", match.get('value'))
... 
Language - 0
Bahasa Indonesia - 32
Català - 34
Deutsch - 4
Eesti - 41
English - 1
Español - 2
Esperanto - 22
Filipino - 21
Français - 3
Italiano - 11
Język polski - 13
LINGUA LATINA - 35
Magyar - 14
Nederlands - 7
Norsk - 18
Português - 8
Română - 27
Suomi - 20
Svenska - 17
čeština - 31
Русский - 10
देवनागरी - 39
ภาษาไทย - 38
中文 - 5
日本語 - 6

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow