when tokenize arabic text with python I get strange result?

Question

I'v been working with NLTK for a research to tokenize Arabic text and analyze it. The problem is when I do this code:

bsm = 'بسم الله الرحمن الريحم'
wordsBsm = nltk.tokenize.wordpunct_tokenize(anas)
print " ".join(wordsBsm)

I get this our put:

� � س� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

I don't know how to solve this problem!

Solution

If you're using Python 2.x, then as bobince said, this should work:

bsm = u'بسم الله الرحمن الريحم'

If you're using Python 3.x, then it should work without having to put the 'u' there. Take a look at Python 2's Unicode HOWTO for more details.

OTHER TIPS

In addition, if you are reading the Arabic text from a file, you could do something like this:

unicode( open('arabic.txt', 'w').read(), 'utf-8')

or, depending on your file's encoding:

unicode( open('arabic.txt', 'w').read(), 'Windows-1256')

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow