我只是想下载这个URL＆＃8230;但它给了我一个错误！＆＃8230; unicode ..（Python）

https://stackoverflow.com/questions/1808612

05-07-2019
|

题

theurl = 'http://bit.ly/6IcCtf/'
urlReq = urllib2.Request(theurl)
urlReq.add_header('User-Agent',random.choice(agents))
urlResponse = urllib2.urlopen(urlReq)
htmlSource = urlResponse.read()
if unicode == 1:
    #print urlResponse.headers['content-type']
    #encoding=urlResponse.headers['content-type'].split('charset=')[-1]
    #htmlSource = unicode(htmlSource, encoding)
    htmlSource =  htmlSource.encode('utf8')
return htmlSource

请查看unicode部分。我尝试了这两个选项......但是没有用。

htmlSource =  htmlSource.encode('utf8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 370747: ordinal not in range(128)

当我尝试更长的编码方法时也是这样......

_mysql_exceptions.Warning: Incorrect string value: '\xE7\xB9\x81\xE9\xAB\x94...' for column 'html' at row 1

解决方案

您的html数据是来自互联网已经编码的字符串，带有一些编码。在将其编码为 utf-8 之前，必须首先对其进行解码。

Python implicity 尝试解码它（这就是为什么你得到 UnicodeDecodeError 而不是 UnicodeEncodeError ）。

您可以通过明确解码您的bytestring （使用适当的编码）尝试将其重新编码为 utf-8 来解决问题。

示例：

utf8encoded = htmlSource.decode('some_encoding').encode('utf-8')

首先使用编码页面的正确编码，而不是'some_encoding'。

您知道字符串正在使用哪种编码，然后才能对其进行解码。

其他提示

不解码？ htmlSource = htmlSource.decode（'utf8'）

解码均值＆quot;解码来自utf8编码的htmlSource＆quot;

编码意味着“将htmlSource编码为utf8编码”

因为你正在提取现有数据（从网站爬行），你需要解码它，当你插入到mysql时，你可能需要根据你的mysql db / table / fields排序规则编码为utf8。

可能你想解码 Utf8，而不是编码它：

htmlSource =  htmlSource.decode('utf8')

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow

我只是想下载这个URL＆＃8230;但它给了我一个错误！ ＆＃8230; unicode ..（Python）

我只是想下载这个URL＆＃8230;但它给了我一个错误！＆＃8230; unicode ..（Python）