Question

I need to compare a unicode string coming from a utf-8 file with a constant defined in the Python script.

I'm using Python 2.7.6 on Linux.

If I run the above script within Spyder (a Python editor) I got it working, but if I invoke the Python script from a terminal, I got the test failing. Do I need to import/define something in the terminal before invoking the script?

Script ("pythonscript.py"):

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import csv

some_french_deps = []
idata_raw = csv.DictReader(open("utf8_encoded_data.csv", 'rb'), delimiter=";")
for rec in idata_raw:
    depname = unicode(rec['DEP'],'utf-8')
    some_french_deps.append(depname)

test1 = "Tarn"
test2 = "Rhône-Alpes"
if test1==some_french_deps[0]:
  print "Tarn test passed"
else:
  print "Tarn test failed"
if test2==some_french_deps[2]:
  print "Rhône-Alpes test passed"
else:
  print "Rhône-Alpes test failed"

utf8_encoded_data.csv:

DEP
Tarn
Lozère
Rhône-Alpes
Aude

Run output from Spyder editor:

Tarn test passed
Rhône-Alpes test passed

Run output from terminal:

$ ./pythonscript.py 
Tarn test passed
./pythonscript.py:20: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if test2==some_french_deps[2]:
Rhône-Alpes test failed
Was it helpful?

Solution

You are comparing a byte string (type str) with a unicode value. Spyder has changed the default encoding from ASCII to UTF-8, and Python does an implicit conversion between byte strings and unicode values when comparing the two types. Your byte strings are encoded to UTF-8, so under Spyder that comparison succeeds.

The solution is to not use byte strings, use unicode literals for your two test values instead:

test1 = u"Tarn"
test2 = u"Rhône-Alpes"

Changing the system default encoding is, in my opinion, a terrible idea. Your code should use Unicode correctly instead of relying on implicit conversions, but to change the rules of implicit conversions only increases the confusion, not make the task any easier.

OTHER TIPS

Just using depname = rec['DEP'] should work as you have already declared the encoding.

If you print some_french_deps[2] it will print Rhône-Alpes so your comparison will work.

As you are comparing a string object with a unicode object, python throws this warning.

To fix this, you can write

test1 = "Tarn"
test2 = "Rhône-Alpes"

as

test1 = u"Tarn"
test2 = u"Rhône-Alpes"

where the 'u' indicates it is a unicode object.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top