Domanda

Is there any chance to exclude some of the unicode (non-alphanumeric) from being considered? I'm tokenizing Arabic words which contains diacritics sometimes, that is considered as non-alphanumeric character but I don't want to remove them and I need to have a space between non-alphanumeric character (other than the diacritics) and alphanumeric.. and this is by using regex ? the unicode that I want to exclude which represent diacritics are as follow :u'\u064b', u'\u064c', u'\u064d', u'\u064e', u'\u064f', u'\u0650', u'\u0651', u'\u0652'

is that possible ?

many thanks in advance

È stato utile?

Soluzione

Just build a custom alpha-numeric pattern for your purpose:

accents = [u'\u064b', u'\u064c', u'\u064d', u'\u064e', u'\u064f', u'\u0650', u'\u0651', u'\u0652']
alnum = r'([\w%s]+)' % re.escape(''.join(accents))
pattern = re.compile(alnum, re.UNICODE)

To find all tokens:

>>> test_str = "...foo" + ''.join(accents) + "...bar"
>>> test_str
'...fooًٌٍَُِّْ...bar'
>>> pattern.findall(test_str)
['fooًٌٍَُِّْ', 'bar']
>>> len(_)
2

Now to put a space between tokens and the rest:

>>> ' '.join(filter(None, pattern.split(test_str)))
'... fooًٌٍَُِّْ ... bar'
>>> len(_.split())
4
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top