Pergunta

I have some background in Information Retrieval from my master's degree days. Now I want to put that to use in building a document search application for a book that's written in Arabic.

My primary tools are Django and either PostgreSQL or MySQL depending on the suggestions posted here.

I've been developing with Django for 5 years in the US, but have never needed internationalization or any Unicode support. So my problem is, how do I handle Arabic words.

Here's my process:

1) I have a few .html files, with 's that have lines of Arabic words. I will build a parser/tokenizer/stemmer and store to database

2) When the user enters a word for search, I'll stem it, and compare it against my database.

What I need help with, is the following:

1) Should the stems/words/lines be stored in the database as arabic words or as python unicode strings

2) If I were to store them as Arabic words, what's better, PostgreSQL or MySQL and how do I support Arabic in either.

3) If I were to store them as unicode strings, will the django admin display them as Arabic words. Because if so, that may suffice. Also, can the admin support Arabic? I.e. If I wanted to alter something in the database, can it be done via the admin.

4) How do I get the Django ORM to support storing Arabic words that the parser will spit out?

Foi útil?

Solução

I have worked with Django for the past 2 years and constantly used Hebrew text in my applications (whether on the html or the servers-side). I found Django to be fantastic with internationalization and working with unicode (more so than python to be frank).

Just follow these few tips and you'll probably be fine:

  1. To every .py file in your app that contains foreign characters, make sure you add a utf-8 bash at the top of the file: # encoding=utf-8

  2. When using strings with arabic characters be sure to add a little u before the string. Make sure you keep that in mind at all times. whenever you are using strings:

    u'some arabic word' #this will work
    u'%s' % word #this will work
    'some string' + u'some arabic string' #this will fail        
    u'some string' + u'some arabic string' #this will work
    
  3. When you first create your Database make sure you save it with proper utf-8 (Database Chartset = utf8 and Database Collation = utf8_general_ci should prove fine).

  4. make sure all pages presenting arabic have this meta tag in your html (better to do it inside the head tag in a base.html file and make all the templates inherit it): <meta charset='utf-8'>

Usually that should be it. Yeah, I know, usually foreign characters are a headache, right? not with Django.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top