As a general rule of thumb, you typically want to decode an encoded string as early as possible, then manipulate it as a unicode object and finally encode it as late as possible (before writing it to a file e.g.).
So e.g.:
with codecs.open("testcodec.txt", "r","utf-8") as f3:
s = f3.read()
# modify s here
with codecs.open("testcodec.txt", "w","utf-8") as f3:
f3.write(s)
As to your question, which way is the best to do it: I don't think there is a difference between using the codecs library or using encode/decode manually. It is a matter of preference, either works.
Simply using open, as in your first example, does not work as python will then try to encode the string using the default codec (which is ASCII, if you didn't change it).
Regarding the question whether you should use unicode strings everywhere:
In principle, yes. If you create a string s = 'asdf'
it has type str
(you can check this with type(s)
), and if you do s2 = u'asdf'
it has type unicode
.
And since it is better to always manipulate unicode objects, the latter is recommended.
If you don't want to always have to append the 'u' in front of a string, you can use the following import:
from __future__ import unicode_literals
Then you can do s = 'asdf'
and s will have the type unicode
. In Python3 this is the default, so the import is only needed in Python2.
For potential gotchas you can take a look at Any gotchas using unicode_literals in Python 2.6?. Basically you don't want to mix utf-8 encoded strings and unicode strings.