Question

I'm trying to compare two strings, the first one, s1, comes from mongoengine and the second one, s2, comes from a Django http request.

They look like this:

>>> s1 = product_model.Product.objects.get(pk=1).name
>>> s1
u'Product \xe4 asdf'
>>> s2 = request.POST['name']
>>> s2
'Product \xc3\xa4 asdf'

They have the same letter in them, the Swedish 'ä', but mongoengines (s1) is in a Python unicode string and Djangos (s2) is in a Python bytestring with unicode encoded characters.

I can easily solve this by e.g. converting the Python unicode string to be a byte string

>>> s1.encode('utf-8') == s2
True

But I would like to think that the best-practice is to have all my Python strings encoded the same way in my system, correct?

How can I tell Django to use Python unicode strings instead? Or how can I tell MongoEngine to use unicode encoded Python bytestrings?

Was it helpful?

Solution

Django docs says:

General string handling

Whenever you use strings with Django – e.g., in database lookups, template rendering or anywhere else – you have two choices for encoding those strings. You can use Unicode strings, or you can use normal strings (sometimes called “bytestrings”) that are encoded using UTF-8.

In Python 3, the logic is reversed, that is normal strings are Unicode, and when you want to specifically create a bytestring, you have to prefix the string with a ‘b’. As we are doing in Django code from version 1.5, we recommend that you import unicode_literals from the future library in your code. Then, when you specifically want to create a bytestring literal, prefix the string with ‘b’.

Python 2 legacy:

my_string = "This is a bytestring"
my_unicode = u"This is an Unicode string"

Python 2 with unicode literals or Python 3:

from __future__ import unicode_literals

my_string = b"This is a bytestring"
my_unicode = "This is an Unicode string"

If you are in Python 2, you can try that. As I said in the comment:

I would not suggest to work with encoded strings. Like this slices say (farmdev.com/talks/unicode) "Decode early, Unicode everywhere, encode late". So i would suggest you to tell Django to use unicode strings, but I am not Django expert, sorry. My approach: s1 == s2.decode("utf8"), so you have both Unicode strings to work with

Hope it works

EDIT: I suppose you are using Django's HttpRequest, so from the docs:

HttpRequest.encoding

A string representing the current encoding used to decode form submission data (or None, which means the DEFAULT_CHARSET setting is used). You can write to this attribute to change the encoding used when accessing the form data. Any subsequent attribute accesses (such as reading from GET or POST) will use the new encoding value. Useful if you know the form data is not in the DEFAULT_CHARSET encoding.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top