python: open and read a file containing german umlauts as unicode

Question 1

I could sort out the problem. Thanks for the helps.

Here it is:

# -*- coding: iso-8859-1 -*-

import sys 
import codecs
import sqlite3

f = codecs.open("suess_sweet.txt", "r", "utf-8")    # suess_sweet.txt file contains two
text_in_unicode = f.read()                          # comma-separated words: süß, sweet 
f.close()

stdout_encoding = sys.stdout.encoding or sys.getfilesystemencoding()

con = sqlite3.connect('dict1.db')
cur = con.cursor()
cur.execute('''create table IF NOT EXISTS table1 (id INTEGER PRIMARY KEY,German,English)''')    

[ger,eng] = text_in_unicode.split(',')

cur.execute('''insert into table1 (id,German,English) VALUES (NULL,?,?)''',(ger,eng))       

con.commit()

sentence = "The German word is: %s" %(ger,)

print sentence.encode(stdout_encoding)

con.close()

I got some help from this page (it's in German)

and the output is:

The German word is: ?süß

Still a small problem is the '?'. I thought that the unicode u' is replaced by ? after encoding. sentence gives:

>>> sentence
u'The German word is: \ufeffs\xfc\xdf '

and encoded sentence gives:

>>> sentence.encode(stdout_encoding)
'The German word is: ?s\xfc\xdf '

so it was not what I thought.

A simple solution comes to my mind, to get rid of the question mark is to use the replace function:

sentence = "The German word is: %s" %(ger,)
to_print = sentence.encode(stdout_encoding)
to_print = to_print.replace('?','')

>>> print(to_print)
The German word is: süß

Thank you SO :)

Question 2

When you open and read a file, you get 8-bit strings not Unicode. In Python 2 to get a Unicode string instead use codecs.open to open the file:

f=codecs.open(filename, 'r', 'utf-8')

Hopefully though you've moved on to Python 3, where the encoding was put into the regular open call. Also unless you open with the 'b' flag for binary, you'll always get Unicode strings not 8-bit binary strings and a default encoding will be used if you don't specify one.

f=open(filename, 'r', encoding='utf-8')

Of course depending on how the file was written you may need to use 'iso-8859-15' instead.

Edit: one big difference between your test code and the commented out code is that reading from the file produces a list, while the test is a single string. Perhaps your problem isn't related to Unicode at all. Try making this substitution in your test code and see if it produces the same error:

text = [u'süß']

Unfortunately I don't have enough experience with SQL in Python to help you further.

Also when you print a list instead of a single string, the Unicode characters will be replaced with their equivalent escape sequences. To see what the strings really look like, print them one at a time. If you're curious it's the difference between __str__ and __repr__.

Edit 2: The character u'\ufeff' is known as a Byte Order Mark or BOM and is inserted by some editors to indicate that the file is truly UTF-8. You should get rid of it before you use the string. There should only be one at the very beginning of the file. See e.g. Reading Unicode file data with BOM chars in Python