如何使用 Python 将文件格式从 Unicode 转换为 ASCII？

https://stackoverflow.com/questions/175240

05-07-2019
|

题

我使用第 3 方工具以 Unicode 格式输出文件。然而，我更喜欢它是 ASCII 格式的。该工具没有更改文件格式的设置。

使用 Python 转换整个文件格式的最佳方法是什么？

解决方案

只需使用 unicode 函数就可以轻松转换文件，但是如果没有直接的ASCII等效字符，你会遇到Unicode字符的问题。

此博客推荐 unicodedata 模块，它似乎可以在没有直接的情况下粗略地转换字符相应的ASCII值，例如

>>> title = u"Klüft skräms inför på fédéral électoral große"

通常会转换为

Klft skrms infr p fdral lectoral groe

这是非常错误的。但是，使用 unicodedata 模块，结果可以更接近原始文本：

>>> import unicodedata
>>> unicodedata.normalize('NFKD', title).encode('ascii','ignore')
'Kluft skrams infor pa federal electoral groe'

其他提示

我认为这是一个比你意识到的更深层次的问题。简单地将文件从Unicode更改为ASCII很容易，但是，将所有Unicode字符转换为合理的ASCII字符（两种编码中都没有多个字母）是另一种。

这个Python Unicode教程可以让您更好地了解转换为ASCII的Unicode字符串会发生什么： http://www.reportlab.com/i18n/python_unicode_tutorial.html

以下是该网站的有用引用：

Python 1.6也获得了“unicode”。内置功能，你可以指定编码：

> >>> unicode('hello') u'hello'
> >>> unicode('hello', 'ascii') u'hello'
> >>> unicode('hello', 'iso-8859-1') u'hello'
> >>>

所有这三个都返回相同   事情，因为'你好'中的人物   这三种编码都很常见。

现在让我们用a编码   欧洲口音，在外面   ASCII。你在控制台上看到的可能是什么   取决于您的操作系统   区域; Windows让我输入   ISO-Latin-1的

> >>> a = unicode('André','latin-1')
> >>> a u'Andr\202'

如果你不能输入急性字母e，   你可以输入字符串'Andr \ 202'，   这是明确的。

Unicode支持所有常见的   迭代和。等操作   分裂。我们不会掠夺他们   这里。

顺便说一下，这是一个linux命令 iconv 来做这种工作。

iconv -f utf8 -t ascii <input.txt >output.txt

这是一些简单（和愚蠢）的代码来进行编码转换。我假设（但你不应该）输入文件是UTF-16（Windows称之为'Unicode'）。

input_codec = 'UTF-16'
output_codec = 'ASCII'

unicode_file = open('filename')
unicode_data = unicode_file.read().decode(input_codec)
ascii_file = open('new filename', 'w')
ascii_file.write(unicode_data.write(unicode_data.encode(output_codec)))

请注意，如果Unicode文件中的任何字符也不是ASCII字符，则此操作无效。您可以执行以下操作将无法识别的字符转换为'？'：

ascii_file.write(unicode_data.write(unicode_data.encode(output_codec, 'replace')))

查看文档，了解更多简单选择。如果您需要做更复杂的事情，您可以查看 UNICODE Hammer Python Cookbook。

像这样：

uc = open(filename).read().decode('utf8')
ascii = uc.decode('ascii')

但请注意，如果有任何字符无法转换为ASCII，则失败会出现 UnicodeDecodeError 异常。

编辑：正如Pete Karl刚才指出的那样，没有从Unicode到ASCII的一对一映射。因此，某些字符无法以信息保存的方式进行转换。此外，标准ASCII或多或少是UTF-8的一个子集，因此您甚至不需要进行任何解码。

对于我的问题，我只想跳过非ascii字符，只输出ascii输出，下面的解决方案效果很好：

    import unicodedata
    input = open(filename).read().decode('UTF-16')
    output = unicodedata.normalize('NFKD', input).encode('ASCII', 'ignore')

重要的是要注意，没有'Unicode'文件格式。 Unicode可以通过几种不同的方式编码为字节。最常见的是UTF-8或UTF-16。您需要知道第三方工具输出的是哪一个。一旦你知道，在不同的编码之间进行转换非常简单：

in_file = open("myfile.txt", "rb")
out_file = open("mynewfile.txt", "wb")

in_byte_string = in_file.read()
unicode_string = bytestring.decode('UTF-16')
out_byte_string = unicode_string.encode('ASCII')

out_file.write(out_byte_string)
out_file.close()

如其他回复中所述，您可能希望为encode方法提供错误处理程序。使用'replace'作为错误处理程序很简单，但如果文本包含无法用ASCII表示的字符，则会破坏文本。

正如其他发帖者所指出的，ASCII 是 unicode 的子集。

但是，如果您：

有一个遗留应用程序
您无法控制该应用程序的代码
你确定你的输入属于 ASCII 子集

那么下面的例子展示了如何做到这一点：

mystring = u'bar'
type(mystring)
    <type 'unicode'>

myasciistring = (mystring.encode('ASCII'))
type(myasciistring)
    <type 'str'>

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow