Question

So here's the problem. I want to, for example, download and print a list of all possible languages from

https://www.fanfiction.net/game/Pok%C3%A9mon/

(Visible under 'filters' button).

In HTML, it's represented as a following series of options:

<option value='17' >Svenska<option value='31' >čeština<option value='10' >Русский
<option value='39' >देवनागरी<option value='38' >ภาษาไทย<option value='5' >中文<option value='6' >日本語

I download it using urllib.request package

def getByUrl(self,url):
    response = urllib.request.urlopen(url)
    html = response.read()
    return html

and then, I try to display it like this:

@staticmethod
def fromCollection_getPossibleLanguages(self,pageContent):
        parsedHtml = BeautifulSoup(pageContent)
        possibleMatches = parsedHtml.findAll('select',{'name':'languageid','class':'filter_select'})
        possibleMatches = possibleMatches[0].findAll('option')

        for match in possibleMatches:
            print(str(match.text.encode('unicode')) + " - " + str(match.get('value')))

However, all my attempts to play with .encode() function(e.g. passing a 'utf-8' or 'unicode' args) have failed to display anything more than, for example:

b'\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9' - 10

I'm displaying it in mac os x's terminal and in Eclipse's console view - same result

Was it helpful?

Solution

You don't need to encode at all. BeautifulSoup has already decoded the response bytes to Unicode values, and print() can take care of the rest here.

However, the page is malformed, as there are no closing </option> tags. This can confuse the standard HTML parser. Install lxml or the html5lib package, and the page can be parsed correctly:

parsedHtml = BeautifulSoup(pageContent, 'lxml')

or

parsedHtml = BeautifulSoup(pageContent, 'html5lib')

Next, you can select the <option> tags with one CSS selector:

possibleMatches = parsedHtml.select('select[name=languageid] option')

for match in possibleMatches:
    print(match.text, "-", match.get('value'))

Demo:

>>> possibleMatches = soup.select('select[name=languageid] option')
>>> for match in possibleMatches:
...     print(match.text, "-", match.get('value'))
... 
Language - 0
Bahasa Indonesia - 32
Català - 34
Deutsch - 4
Eesti - 41
English - 1
Español - 2
Esperanto - 22
Filipino - 21
Français - 3
Italiano - 11
Język polski - 13
LINGUA LATINA - 35
Magyar - 14
Nederlands - 7
Norsk - 18
Português - 8
Română - 27
Suomi - 20
Svenska - 17
čeština - 31
Русский - 10
देवनागरी - 39
ภาษาไทย - 38
中文 - 5
日本語 - 6
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top