Question

I have written my program to read words from a text file and enter them into a sqlite database and also treat them as strings. But I need to enter some words containing German umlauts: ä, ö, ü, ß.

Here is a prepared piece of code:

I tried both with # -- coding: iso-8859-15 -- and # -- coding: utf-8 -- No difference(!)

    # -*- coding: iso-8859-15 -*-
    import sqlite3
    
    dbname = 'sampledb.db'
    filename ='text.txt'


    con = sqlite3.connect(dbname)
    cur = con.cursor()
    cur.execute('''create table IF NOT EXISTS table1 (id INTEGER PRIMARY KEY,name)''')    

    #f=open(filename)
    #text = f.readlines()
    #f.close()

    text = u'süß'

    print (text)
    cur.execute("insert into table1 (id,name) VALUES (NULL,?)",(text,))       

    con.commit()

    sentence = "The name is: %s" %(text,)

    print (sentence)
    f.close()
    con.close()

the above code runs well. But I need to read 'text' from a file containing the word 'süß'. So when I uncomment the 3 lines ( f.open(filename) .... ), and commenting text = u'süß' it brings the error

    sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type.

I tried codecs module to read a utf-8, iso-8859-15. But I could not decode them to the string 'süß' which I need to complete my sentence at the end of the code.

Once I tried decoding to utf-8 before inserting into the database. It worked, but I could not use it as string.

Is there a way I can import süß from a file and use it both for inserting to sqlite and using as string?


more detail:

Here I add more details for clarification. I have used codecs.open before. The text file containing the word süß is saved as utf-8. Using f=codecs.open(filename, 'r', 'utf-8') and text=f.read(), I read the file as unicode u'\ufeffs\xfc\xdf'. Inserting this unicode in sqlite3 is smoothly done: cur.execute("insert into table1 (id,name) VALUES (NULL,?)",(text,)).

The problem is here: sentence = "The name is: %s" %(text,) gives u'The name is: \ufeffs\xfc\xdf', and I also need to print(text) as my output süß, while print(text) brings this error UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>.

Thank you.

Was it helpful?

Solution 2

I could sort out the problem. Thanks for the helps.

Here it is:

# -*- coding: iso-8859-1 -*-

import sys 
import codecs
import sqlite3

f = codecs.open("suess_sweet.txt", "r", "utf-8")    # suess_sweet.txt file contains two
text_in_unicode = f.read()                          # comma-separated words: süß, sweet 
f.close()

stdout_encoding = sys.stdout.encoding or sys.getfilesystemencoding()

con = sqlite3.connect('dict1.db')
cur = con.cursor()
cur.execute('''create table IF NOT EXISTS table1 (id INTEGER PRIMARY KEY,German,English)''')    

[ger,eng] = text_in_unicode.split(',')

cur.execute('''insert into table1 (id,German,English) VALUES (NULL,?,?)''',(ger,eng))       

con.commit()

sentence = "The German word is: %s" %(ger,)

print sentence.encode(stdout_encoding)

con.close()

I got some help from this page (it's in German)

and the output is:

The German word is: ?süß 

Still a small problem is the '?'. I thought that the unicode u' is replaced by ? after encoding. sentence gives:

>>> sentence
u'The German word is: \ufeffs\xfc\xdf '

and encoded sentence gives:

>>> sentence.encode(stdout_encoding)
'The German word is: ?s\xfc\xdf '

so it was not what I thought.

A simple solution comes to my mind, to get rid of the question mark is to use the replace function:

sentence = "The German word is: %s" %(ger,)
to_print = sentence.encode(stdout_encoding)
to_print = to_print.replace('?','')

>>> print(to_print)
The German word is: süß

Thank you SO :)

OTHER TIPS

When you open and read a file, you get 8-bit strings not Unicode. In Python 2 to get a Unicode string instead use codecs.open to open the file:

f=codecs.open(filename, 'r', 'utf-8')

Hopefully though you've moved on to Python 3, where the encoding was put into the regular open call. Also unless you open with the 'b' flag for binary, you'll always get Unicode strings not 8-bit binary strings and a default encoding will be used if you don't specify one.

f=open(filename, 'r', encoding='utf-8')

Of course depending on how the file was written you may need to use 'iso-8859-15' instead.

Edit: one big difference between your test code and the commented out code is that reading from the file produces a list, while the test is a single string. Perhaps your problem isn't related to Unicode at all. Try making this substitution in your test code and see if it produces the same error:

text = [u'süß']

Unfortunately I don't have enough experience with SQL in Python to help you further.

Also when you print a list instead of a single string, the Unicode characters will be replaced with their equivalent escape sequences. To see what the strings really look like, print them one at a time. If you're curious it's the difference between __str__ and __repr__.

Edit 2: The character u'\ufeff' is known as a Byte Order Mark or BOM and is inserted by some editors to indicate that the file is truly UTF-8. You should get rid of it before you use the string. There should only be one at the very beginning of the file. See e.g. Reading Unicode file data with BOM chars in Python

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top