String.maketrans for English and Persian numbers

https://stackoverflow.com/questions/11879025

25-06-2021
|

Question

I have a function like this:

persian_numbers = '۱۲۳۴۵۶۷۸۹۰'
english_numbers = '1234567890'
arabic_numbers  = '١٢٣٤٥٦٧٨٩٠'

english_trans   = string.maketrans(english_numbers, persian_numbers)
arabic_trans    = string.maketrans(arabic_numbers, persian_numbers)

text.translate(english_trans)
text.translate(arabic_trans)

I want it to translate all Arabic and English numbers to Persian. But Python says:

english_translate = string.maketrans(english_numbers, persian_numbers)
ValueError: maketrans arguments must have same length

I tried to encode strings with Unicode utf-8 but I always got some errors! Sometimes the problem is Arabic string instead! Do you know a better solution for this job?

EDIT:

It seems the problem is Unicode characters length in ASCII. An Arabic number like '۱' is two character -- that I find out with ord(). And the length problem starts from here :-(

Solution

See unidecode library which converts all strings into UTF8. It is very useful in case of number input in different languages.

In Python 2:

>>> from unidecode import unidecode
>>> a = unidecode(u"۰۱۲۳۴۵۶۷۸۹")
>>> a
'0123456789'
>>> unidecode(a)
'0123456789'

In Python 3:

>>> from unidecode import unidecode
>>> a = unidecode("۰۱۲۳۴۵۶۷۸۹")
>>> a
'0123456789'
>>> unidecode(a)
'0123456789'

OTHER TIPS

Unicode objects can interpret these digits (arabic and persian) as actual digits - no need to translate them by using character substitution.

EDIT - I came out with a way to make your replacement using Python2 regular expressions:

# coding: utf-8

import re

# Attention: while the characters for the strings bellow are 
# dislplayed indentically, inside they are represented
# by distinct unicode codepoints

persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
arabic_numbers  = u'١٢٣٤٥٦٧٨٩٠'
english_numbers = u'1234567890'


persian_regexp = u"(%s)" %  u"|".join(persian_numbers)
arabic_regexp = u"(%s)" % u"|".join(arabic_numbers)

def _sub(match_object, digits):
    return english_numbers[digits.find(match_object.group(0))]

def _sub_arabic(match_object):
    return _sub(match_object, arabic_numbers)

def _sub_persian(match_object):
    return _sub(match_object, persian_numbers)


def replace_arabic(text):
    return re.sub(arabic_regexp, _sub_arabic, text)

def replace_persian(text):
    return re.sub(arabic_regexp, _sub_persian, text)

Attempt that the "text" parameter must be unicode itself.

(also this code could be shortened by using lambdas and combining some expressions in a single line, but there is no point in doing so, but for loosing readability)

It should work to you up to here, but please read on the original answer I had posted

-- original answer

So, if you instantiate your variables as unicode (prepending an u to the quote char), they are correctly understood in Python:

>>> persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
>>> english_numbers = u'1234567890'
>>> arabic_numbers  = u'١٢٣٤٥٦٧٨٩٠'
>>> 
>>> print int(persian_numbers)
1234567890
>>> print int(english_numbers)
1234567890
>>> print int(arabic_numbers)
1234567890
>>> persian_numbers.isdigit()
True
>>>

By the way, the "maketrans" method does not exist for unicode objects (in Python2 - see the comments).

It is very important to understand the basics about unicode - for everyone, even people writing English only programs who think they will never deal with any char out of the 26 latin letters. When writing code that will deal with different chars it is vital - the program can't possibly work without you knowing what you are doing except by chance.

A very good article to read is http://www.joelonsoftware.com/articles/Unicode.html - please read it now. You can keep in mind, while reading it, that Python allows one to translate unicode characters to a string in any "physical" encoding by using the "encode" method of unicode objects.

>>> arabic_numbers  = u'١٢٣٤٥٦٧٨٩٠'
>>> len(arabic_numbers)
10
>>> enc_arabic = arabic_numbers.encode("utf-8")
>>> print enc_arabic
١٢٣٤٥٦٧٨٩٠
>>> len(enc_arabic)
20
>>> int(enc_arabic)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '\xd9\xa1\xd9\xa2\xd9\xa3\xd9\xa4\xd9\xa5\xd9\xa6\xd9\xa7\xd9\xa8\xd9\xa9\xd9\xa0'

Thus, the characters loose their sense as "single entities" and as digits when encoding - the encoded object (str type in Python 2.x) is justa strrng of bytes - which nonetheless is needed when sending these characters to any output from the program - be it console, GUI Window, database, html code, etc...

unidecode converts all characters from Persian to English, If you want to change only numbers follow bellow:

In python3 you can use this code to convert any Persian|Arabic number to English number while keeping other characters unchanged:

intab='۱۲۳۴۵۶۷۸۹۰١٢٣٤٥٦٧٨٩٠'
outtab='12345678901234567890'
translation_table = str.maketrans(intab, outtab)
output_text = input_text.translate(translation_table)

You can use persiantools package:

Examples:

>>> from persiantools import digits

>>> digits.en_to_fa("0987654321")
'۰۹۸۷۶۵۴۳۲۱'

>>> digits.ar_to_fa("٠٩٨٧٦٥٤٣٢١")   # or digits.ar_to_fa(u"٠٩٨٧٦٥٤٣٢١")
'۰۹۸۷۶۵۴۳۲۱'

Use Unicode Strings:

persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
english_numbers = u'1234567890'
arabic_numbers  = u'١٢٣٤٥٦٧٨٩٠'

And make sure the encoding of your Python file is correct.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow