このURLをダウンロードしたいだけですが、エラーが発生しています！＆＃8230; unicode ..（Python）

https://stackoverflow.com/questions/1808612

05-07-2019
|

質問

theurl = 'http://bit.ly/6IcCtf/'
urlReq = urllib2.Request(theurl)
urlReq.add_header('User-Agent',random.choice(agents))
urlResponse = urllib2.urlopen(urlReq)
htmlSource = urlResponse.read()
if unicode == 1:
    #print urlResponse.headers['content-type']
    #encoding=urlResponse.headers['content-type'].split('charset=')[-1]
    #htmlSource = unicode(htmlSource, encoding)
    htmlSource =  htmlSource.encode('utf8')
return htmlSource

Unicode部分をご覧ください。これらの2つのオプションを試しましたが、機能しません。

htmlSource =  htmlSource.encode('utf8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 370747: ordinal not in range(128)

また、これはエンコードのより長い方法を試してみると...

_mysql_exceptions.Warning: Incorrect string value: '\xE7\xB9\x81\xE9\xAB\x94...' for column 'html' at row 1

解決

あなたのhtmlデータは、インターネットからエンコード済みにエンコードされた文字列です。 utf-8 にエンコードする前に、最初にデコードする必要があります。

Pythonは、デコードしようとしている 暗黙的 です（そのため、 UnicodeEncodeError ではなく UnicodeDecodeError が発生します）。

utf-8 に再エンコードする前に、適切なエンコードを使用して、適切なエンコードを使用してバイト文字列をデコードすることで問題を解決できます。

例：

utf8encoded = htmlSource.decode('some_encoding').encode('utf-8')

'some_encoding' ではなく、ページが最初にエンコードされた正しいエンコードを使用します。

文字列をデコードする前に、文字列が使用しているエンコードを知っている必要があります。

他のヒント

デコードしない？ htmlSource = htmlSource.decode（ 'utf8'）

decode mean＆quot; deut htmlSource from utf8 encoding＆quot;

encodeは、「htmlSourceをutf8エンコードにエンコード」＆quot;
を意味します
既存のデータを抽出する（Webサイトからクロールする）ため、それをデコードする必要があります。mysqlに挿入する場合、mysql db / table / fields照合に従ってutf8としてエンコードする必要がある場合があります。

おそらくエンコードではなくUtf8をデコードしたいでしょう：

htmlSource = htmlSource.decode('utf8')

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow

このURLをダウンロードしたいだけですが、エラーが発生しています！ ＆＃8230; unicode ..（Python）

このURLをダウンロードしたいだけですが、エラーが発生しています！＆＃8230; unicode ..（Python）