identifying if the character is a digit or Unicode character within a word in python

https://stackoverflow.com/questions/22741339

24-06-2023
|

Question

I want to find if a word contains digit and characters and if so separate the digit part and the character part. I want to check for tamil words, ex: ரூ.100 or ரூ100. I want to seperate the ரூ. and 100, and ரூ and 100. How do i do it in python. I tried like this:

    for word in f.read().strip().split(): 
      for word1, word2, word3 in zip(word,word[1:],word[2:]): 
        if word1 == "ர" and word2 == "ூ " and word3.isdigit(): 
           print word1 
           print word2 
        if word1.decode('utf-8') == unichr(0xbb0) and word2.decode('utf-8') == unichr(0xbc2): 
           print word1 print word2

Solution

You can use (.*?)(\d+)(.*) regular expression, that will save 3 groups: everything before digits, digits and everything after:

>>> import re
>>> pattern = ur'(.*?)(\d+)(.*)'
>>> s = u"ரூ.100"
>>> match = re.match(pattern, s, re.UNICODE)
>>> print match.group(1)
ரூ.
>>> print match.group(2)
100

Or, you can unpack matched groups into variables, like this:

>>> s = u"100ஆம்"
>>> match = re.match(pattern, s, re.UNICODE)
>>> before, digits, after = match.groups()
>>> print before

>>> print digits
100
>>> print after
ஆம்

Hope that helps.

OTHER TIPS

Use unicode properties:

\pL stands for a letter in any language
\pN stands for a digit in any language.

In your case it could be:

(\pL+\.?)(\pN+)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow