Validating a Unicode Name

https://stackoverflow.com/questions/626697

06-07-2019
|

Question

In ASCII, validating a name isn't too difficult: just make sure all the characters are alphabetical.

But what about in Unicode (utf-8) ? How can I make sure there are no commas or underscores (outside of ASCII scope) in a given string?

(ideally in Python)

Solution

Just convert bytestring (your utf-8) to unicode objects and check if all characters are alphabetic:

s.isalpha()

This method is locale-dependent for bytestrings.

OTHER TIPS

Maybe the unicodedata module is useful for this task. Especially the category() function. For existing unicode categories look at unicode.org. You can then filter on punctuation characters etc.

Depending on how you define "name", you could go with checking it against this regex:

^\w+$

However, this will allow numbers and underscores. To rule them out, you can do a second test against:

[\d_]

and make your check fail on match. These two could be combined as follows:

^(?:(?![\d_])\w)+$

But for regex performance reasons, I would rather do two separate checks.

From the docs:

\w

When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

This might be a step towards a solution:

import unicodedata
EXCEPTIONS= frozenset(u"'.")
CATEGORIES= frozenset( ('Lu', 'Ll', 'Lt', 'Pd', 'Zs') )
# O'Rourke, Franklin D. Roosevelt

def test_unicode_name(unicode_name):
    return all(
      uchar in EXCEPTIONS
        or unicodedata.category(uchar) in CATEGORIES
      for uchar in unicode_name)

>>> test_unicode_name(u"Michael O'Rourke")
True
>>> test_unicode_name(u"Χρήστος Γεωργίου")
True
>>> test_unicode_name(u"Jean-Luc Géraud")
True

Add exceptions, and further checks that I possibly missed.

The letters property of the string module should give you what you want. This property is locale-specific, so as long as you know the language of the text being passed to you, you can use setlocale() and validate against those characters.

http://docs.python.org/library/string.html#module-string

As you point out, though, in a truly "unicode" world, there's no way at all to know what characters are "alphabetical" unless you know the language. If you don't know the language, you could either default to ASCII, or run through the locales for common languages.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow