How to strip unicode “punctuation” from Python string

https://stackoverflow.com/questions/5414818

29-10-2019
|

Question

Here's the problem, I have a unicode string as input to a python sqlite query. The query failed ('like'). It turns out the string, 'FRANCE' doesn't have 6 characters, it has seven. And the seventh is . . . unicode U+FEFF, a zero-width no-break space.

How on earth do I trap a class of such things before the query?

Solution

You may use the unicodedata categories as part of the unicode data table in Python:

>>> unicodedata.category(u'a')
'Ll'
>>> unicodedata.category(u'.')
'Po'
>>> unicodedata.category(u',')
'Po'

The categories for punctation characters start with 'P' as you can see. So you need to filter you out char by char (using a list comprehension).

See also:

in your case :

>>> unicodedata.category(u'\ufeff')
'Cf'

So you may perform some whitelisting based on the categories for characters.

OTHER TIPS

In general, input validation should be done by using a whitelist of allowable characters if you can define such a thing for your use case. Then you simply throw out anything that isn't on the whitelist (or reject the input altogether).

If you can define a set of allowed characters, then you can use a regular expression to strip out everything else.

For example, lets say you know "country" will only have upper-case English letters and spaces you could strip out everything else, including your nasty unicode letter like this:

>>> import re
>>> country = u'FRANCE\ufeff'
>>> clean_pattern = re.compile(u'[^A-Z ]+')
>>> clean_pattern.sub('', country)
u'FRANCE'

If you can't define a set of allowed characters, you're in deep trouble, because it becomes your task to anticipate all tens of thousands of possible unexpected unicode characters that could be thrown at you--and more and more are added to the specs as languages evolve over the years.

That's also the byte-order mark, BOM. Just cleanup your strings first to eliminate those, using something like:


>>> f = u'France\ufeff'
>>> f
u'France\ufeff'
>>> print f
France
>>> f.replace(u'\ufeff', '')
u'France'
>>> f.strip(u'\ufeff')
u'France'

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow