Read multilanguage strings from html via Python 2.7

https://stackoverflow.com/questions/18810507

28-06-2022
|

Question

I am new in python 2.7 and I am trying to extract some info from html files. More specifically, I wand to read some text information that contains multilanguage information. I give my script hopping to make things more clear.

import urllib2
import BeautifulSoup

url = 'http://www.bbc.co.uk/zhongwen/simp/'

page = urllib2.urlopen(url).read().decode("utf-8")
dom = BeautifulSoup.BeautifulSoup(page)
data = dom.findAll('meta', {'name' : 'keywords'})

print data[0]['content'].encode("utf-8")

the result I am taking is

BBCϊ╕φόΨΘύ╜ΣΎ╝Νϊ╕╗ώκ╡Ύ╝Νbbcchinese.com, email news, newsletter, subscription, full text

The problem is in the first string. Is there any way to print what exactly I am reading? Also is there any way to find the exact encoding of the language of each script?

PS: I would like to mention that the site selected totally randomly as it is representative to the problem I am encountering.

Thank you in advance!

Solution

You have problem with the terminal where you are outputting the result. The script works fine and if you output data to file you will get it correctly.

Example:

import urllib2
from bs4 import BeautifulSoup

url = 'http://www.bbc.co.uk/zhongwen/simp/'

page = urllib2.urlopen(url).read().decode("utf-8")
dom = BeautifulSoup(page)
data = dom.findAll('meta', {'name' : 'keywords'})

with open("test.txt", "w") as myfile:
    myfile.write(data[0]['content'].encode("utf-8"))

test.txt:

BBC中文网，主页，bbcchinese.com, email news, newsletter, subscription, full text

Which OS and terminal you are using?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow