Why am I getting extra escape characters when I insert unicode characters into sqlite3 databases using Python 2.7?

StackOverflow https://stackoverflow.com/questions/23646282

سؤال

I query an API and get a json blob with the following value:

{
    ...
    "Attribute" : "Some W\u00e9irdness", 
    ...
}

(The correct value, of course, is 'Some Wéirdness')

I add that value along with some other stuff to a list of fields I want to add to my sqlite3 database. The list looks like this:

[None, 203, None, None, True, u'W\xe9irdness', None, u'Some', None, None, u'Some W\xe9irdness', None, u'Some W\xe9irdness', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]

I notice that we've already undergone a switch from \x00e9 to \xe9, and I'm not sure why that is yet, but I was hoping it didn't matter...it's just a different unicode encoding.

Before trying to insert into the sqlite table, I 'stringatize' the list (see function below) and make it a tuple:

('', '203', '', '', 'True', 'W\xe9irdness', '', 'Some', '', '', 'Some W\xe9irdness', '', 'Some W\xe9irdness', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '')

I then do the insertion:

my_tuple = tuple(val for val in my_utils.stringatize(my_list))

sql = "INSERT OR REPLACE INTO roster VALUES %s" % repr(my_tuple)

cur.execute(sql)

When I retrieve it later with a SELECT statement, the value has an additional escape (backslash) character added:

u'Some W\\xe9irdness'

First, I ALREADY KNOW that I'm not supposed to use string interpolation with sqlite. However, I couldn't figure out how to do it with ?'s when the number of fields per record may change over time and I want the code to be flexible and not have to come back and add question marks in there if I add fields. (If you know a better way to do this, I'm all ears, but it's probably for another post.)

To troubleshoot, I print the formatted insertion sql statement, and I only see ONE backslash:

INSERT OR REPLACE INTO roster VALUES ('', '203', '', '', 'True', 'W\xe9irdness', '', 'Some', '', '', 'Some W\xe9irdness', '', 'Some W\xe9irdness', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '')

This is the same way it looked in the list I had above, so I'm perplexed. Perhaps this is getting interpreted as a string with a backslash that must be escaped and the xe9 is just getting treated as ascii text. Here's the stringatize function I'm using to prepare the list for insertion:

def stringatize(cell_list, encoding = 'raw_unicode_escape', delete_quotes = False):
    """
    Converts every 'cell' in a 'row' (generally something extracted from
     a spreadsheet) to a unicode, then returns the list of cells (with all
     strings now, of course).
    """

    stringatized_list = []

    for cell in cell_list:
        if isinstance(cell, (datetime.datetime)):
            new = cell.strftime("%Y-%m-%dT%H:%M:%S")
        elif isinstance(cell, (datetime.date)):
            new = cell.strftime("%Y-%m-%d")
        elif isinstance(cell, (datetime.time)):
            new = cell.strftime("%H:%M:%S")
        elif isinstance(cell, (int, long)):
            new = str(cell)    
        elif isinstance(cell, (float)):    
            new = "%.2f" % cell
        elif cell == None:
            new = ""    
        else:                
            new = cell    

        if delete_quotes:    
            new = new.replace("\"","")   

        my_unicode = new.encode(encoding)    
        stringatized_list.append(my_unicode)

    return stringatized_list

I appreciate any ideas you have for me on this front. The object is to eventually dump this value into an Excel sheet, which does work with Unicode and should therefore display the value correctly.

EDIT: In response to @CL's inquiry, I try removing the 'encode' line from my stringatize function.

It now ends as follows:

    #my_unicode = new.encode(encoding)
    my_unicode = new

    stringatized_list.append(my_unicode)

return stringatized_list

The new sql comes out looking like this (and below is the traceback I get when I try to execute that):

INSERT OR REPLACE INTO roster VALUES ('', u'203', u'', u'', 'True', u'W\xe9irdness', '', u'Some', '', '', u'Some W\xe9irdness', '', u'Some W\xe9irdness', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '')

Traceback (most recent call last):
  File "test.py", line 80, in <module>
    my_call
  File redacted.py, line 102, in my_function
    cur.execute(sql)
sqlite3.OperationalError: near "'203'": syntax error

I did mean to cast that number to a string. I suspect it has to do with the repr(my_tuple) I'm doing and the u'' not actually symbolizing a unicode anymore.

هل كانت مفيدة؟

المحلول

"Some W\u00e9irdness"
"Some Wéirdness"

Are equally-valid JSON string literal forms of exactly the same value, Some Wéirdness.

u'W\xe9irdness'

I notice that we've already undergone a switch from \x00e9 to \xe9, and I'm not sure why that is yet, but I was hoping it didn't matter...it's just a different unicode encoding.

There is no switch, and no encoding, the string is still Some Wéirdness.

You just printed the string from Python, and in Python string literals there is a \xNN form that JSON doesn't have, shorthand for \u00NN.

my_tuple = tuple(val for val in my_utils.stringatize(my_list))
sql = "INSERT OR REPLACE INTO roster VALUES %s" % repr(my_tuple)
cur.execute(sql)

Don't do this. A Python tuple literal as produced by repr is not at all the same format as an SQL value list. In particular, SQL string literals do not have any concept of backslash escapes, so the \xE9 that denotes an é in a Python Unicode string literal, in SQL just means a backslash, the letters x, E and the number 9.

Whilst there are appropriate ways to encode a string to fit in an SQL string literal, you should avoid that because getting it right is not straightforward and getting it wrong is a security issue. Instead, forget ‘stringatizing’ and just pass the raw values to the database as parameters:

cur.execute(
    'INSERT OR REPLACE INTO roster VALUES (?, ?, ?, ?, ....)',
    my_list
)
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top