如何检查Python中的字符串是否为ASCII？

https://stackoverflow.com/questions/196345

10-07-2019
|

题

我想检查字符串是否为ASCII格式。

我知道ord()，但是当我尝试ord('é')时，我有TypeError: ord() expected a character, but string of length 2 found。我知道它是由我构建Python的方式引起的（如 <=>的文档中所述））。

还有其他检查方法吗？

解决方案

def is_ascii(s):
    return all(ord(c) < 128 for c in s)

其他提示

我认为你没有问正确的问题 -

python中的字符串没有与'ascii'，utf-8或任何其他编码对应的属性。你的字符串的来源（无论你是从文件中读取，从键盘输入等）都可能在ascii中编码了一个unicode字符串来生成你的字符串，但这就是你需要去寻找答案的地方。

也许你可以问的问题是：<！>这个字符串是ascii中编码unicode字符串的结果吗？<！> quot; - 你可以回答通过尝试：

try:
    mystring.decode('ascii')
except UnicodeDecodeError:
    print "it was not a ascii-encoded unicode string"
else:
    print "It may have been an ascii-encoded unicode string"

Python 3方式：

isascii = lambda s: len(s) == len(s.encode())

要检查，请传递测试字符串：

str1 = "♥O◘♦♥O◘♦"
str2 = "Python"

print(isascii(str1)) -> will return False
print(isascii(str2)) -> will return True

Python 3.7中的新功能（ bpo32677 ）

不再对字符串进行厌烦/低效的ascii检查，新的内置str / bytes / bytearray方法 - .isascii() 将检查字符串是否为ascii。

print("is this ascii?".isascii())
# True

最近进入类似的事情 - 以供将来参考

import chardet

encoding = chardet.detect(string)
if encoding['encoding'] == 'ascii':
    print 'string is in ascii'

您可以使用：

string_ascii = string.decode(encoding['encoding']).encode('ascii')

你的问题不正确;您看到的错误不是您构建python的结果，而是字节字符串和unicode字符串之间的混淆。

字节串（例如，在python语法中的<！>“foo <！>”;或“bar”）是八位字节序列;数字从0到255。 Unicode字符串（例如u <！>“foo <！>”或u'bar'）是unicode代码点的序列;数字0-1112064。但是你似乎对字符<！>＃233;感兴趣，它（在你的终端中）是一个代表单个字符的多字节序列。

而不是ord(u'é')，试试这个：

>>> [ord(x) for x in u'é']

告诉你哪个代码序列点<！>“<！>＃233; <！>”;代表。它可能会给你[233]，或者它可能会给你[101,770]。

而不是chr()反转这个，而不是unichr()：

>>> unichr(233)
u'\xe9'

这个字符实际上可以表示单个或多个unicode <！>“代码点<！>”;它们本身代表字素或字符。它具有强烈的重音（即，代码点233）<！>，或者<！>“e <！>”; （代码点101），后跟<！>“前一个字符<！>”的强调重音; （代码点770）。所以这个完全相同的字符可以表示为Python数据结构u'e\u0301'或u'\u00e9'。

大多数情况下，您不必关心这一点，但如果您在迭代unicode字符串，它可能会成为问题，因为迭代按代码点而不是可分解字符工作。换句话说，len(u'e\u0301') == 2和len(u'\u00e9') == 1。如果这对您很重要，您可以使用 unicodedata.normalize转换合成表格和分解表格。。

Unicode词汇表可以帮助您了解其中的一些问题，指出每个具体的具体方法术语是指文本表示的不同部分，这比许多程序员意识到的要复杂得多。

这样做怎么样？

import string

def isAscii(s):
    for c in s:
        if c not in string.ascii_letters:
            return False
    return True

Vincent Marchetti有正确的想法，但str.decode已在Python 3中弃用。在Python 3中，您可以使用str.encode进行相同的测试：

try:
    mystring.encode('ascii')
except UnicodeEncodeError:
    pass  # string is not ascii
else:
    pass  # string is ascii

请注意，您要捕获的异常也已从UnicodeDecodeError更改为UnicodeEncodeError。

我在尝试确定如何使用/编码/解码其编码我不确定的字符串（以及如何转义/转换该字符串中的特殊字符）时发现了这个问题。

我的第一步应该是检查字符串的类型 - 我没有意识到我可以从类型获得关于其格式的良好数据。这个答案非常有帮助，并找到了我的问题的真正根源。

如果你有一个粗鲁和持久的

UnicodeDecodeError：'ascii'编解码器无法解码位置263中的字节0xc3：序数不在范围内（128）

特别是当你在ENCODING时，确保你没有尝试unicode（）一个已经是unicode的字符串 - 由于一些可怕的原因，你得到ascii编解码器错误。（另请参阅 Python Kitchen食谱和 Python docs 教程，以便更好地理解这可能有多糟糕。）

最终我确定我想要做的是：

escaped_string = unicode(original_string.encode('ascii','xmlcharrefreplace'))

在调试时也很有帮助我将文件中的默认编码设置为utf-8（将它放在python文件的开头）：

# -*- coding: utf-8 -*-

允许你测试特殊字符（'<！>＃224; <！>＃233; <！>＃231;'），而不必使用他们的unicode转义符（u'\ xe0 \ xe9 \ xe7'）

>>> specials='àéç'
>>> specials.decode('latin-1').encode('ascii','xmlcharrefreplace')
'&#224;&#233;&#231;'

要从Python 2.6（以及Python 3.x）改进Alexander的解决方案，您可以使用帮助程序模块curses.ascii并使用curses.ascii.isascii（）函数或其他各种函数： https://docs.python.org/2.6/library/curses.ascii.html

from curses import ascii

def isascii(s):
    return all(ascii.isascii(c) for c in s)

您可以使用接受Posix标准[[：ASCII：]]定义的正则表达式库。

Python中的一个sting（str - type）是一系列字节。只有通过查看字符串才能告诉没有办法这个字节是否代表ascii字符串，8-bit charset中的字符串如ISO-8859-1或用UTF编码的字符串8或UTF-16或其他。

但是，如果您知道所使用的编码，那么您可以decode将str转换为unicode字符串，然后使用正则表达式（或循环）来检查它是否包含您所关注范围之外的字符。 / p>

Like @RogerDahl's answer but it's more efficient to short-circuit by negating the character class and using search instead of find_all or match.

>>> import re
>>> re.search('[^\x00-\x7F]', 'Did you catch that \x00?') is not None
False
>>> re.search('[^\x00-\x7F]', 'Did you catch that \xFF?') is not None
True

I imagine a regular expression is well-optimized for this.

To prevent your code from crashes, you maybe want to use a try-except to catch TypeErrors

>>> ord("¶")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

For example

def is_ascii(s):
    try:
        return all(ord(c) < 128 for c in s)
    except TypeError:
        return False

import re

def is_ascii(s):
    return bool(re.match(r'[\x00-\x7F]+$', s))

To include an empty string as ASCII, change the + to *.

I use the following to determine if the string is ascii or unicode:

>> print 'test string'.__class__.__name__
str
>>> print u'test string'.__class__.__name__
unicode
>>>

Then just use a conditional block to define the function:

def is_ascii(input):
    if input.__class__.__name__ == "str":
        return True
    return False

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow