Question

I have a Haskell application that, as one of many steps, needs to store and retrieve raw binary blob data in a database. I'm not completely above deciding to, instead, store that data in plain disk files, but that does start leading to an additional round of permissions issues, so right now I want to go with the database.

I've created a table with a column of type bytea.

I have a Lazy Bytestring in memory.

When I make a call like this

run conn "INSERT INTO documents VALUES (?)" [toSql $ rawData mydoc]

postgres gets a bit angry at the data. The exact error message is

invalid byte sequence for encoding \"UTF8\": 0xcf72

I also know beyond doubt that I have NUL values in the data stream. So, with all of that in mind, what is the correct way to encode the data safely for insertion?


Updated

Here is the description for my table

db=> \d+ documents
                          Table "public.documents"
     Column      |            Type             | Modifiers | Storage  | Description 
-----------------+-----------------------------+-----------+----------+-------------
 id              | character varying(16)       | not null  | extended | 
 importtime      | timestamp without time zone | not null  | plain    | 
 filename        | character varying(255)      | not null  | extended | 
 data            | bytea                       | not null  | extended | 
 recordcount     | integer                     | not null  | plain    | 
 parsesuccessful | boolean                     | not null  | plain    | 
Indexes:
    "documents_pkey" PRIMARY KEY, btree (id)

This is the full text of a module that demonstrates the current problem I'm having after adding jamsdidh's code. My error message has changed from the encoding problem above to "invalid input syntax for type bytea".

module DBMTest where

import qualified Data.Time.Clock as Clock
import Database.HDBC.PostgreSQL
import Database.HDBC
import Data.ByteString.Internal
import Data.ByteString hiding (map)
import Data.Char
import Data.Word8
import Numeric

exampleData = pack ([0..65536] :: [Word8]) :: ByteString

safeEncode :: ByteString -> ByteString
safeEncode x = pack (convert' =<< unpack x)
    where
    convert' :: Word8 -> [Word8]
    convert' 92 = [92, 92]
    convert' x | x >= 32 && x < 128 = [x]
    convert' x = 92:map c2w (showIntAtBase 8 intToDigit x "")

runTest = do
    conn <- connectPostgreSQL "dbname=db"
    t <- Clock.getCurrentTime
    withTransaction conn
        (\conn -> run conn
            "INSERT INTO documents (id, importTime, filename, data, recordCount, parseSuccessful) VALUES (?, ?, ?, ?, ?, ?)"
            [toSql (15 :: Int),
             toSql t,
             toSql ("Demonstration data" :: String),
             toSql $ safeEncode exampleData,
             toSql (15 :: Int),
             toSql (True :: Bool)])
Was it helpful?

Solution

I beleive this is a bug in HDBC-postgresql. I can explain why I think this is so, and can give you a workaround that I put together and tested.


I would expect HDBC-postgresql to convert a bytestring to the appropriate format to be inserted, but you can quickly verify that it is instead expecting the bytestring to hold the octal-backspace-escaped values of the data. For instance,

run conn "INSERT INTO documents VALUES (?)" [toSql $ B.pack [92, 0x31, 0x30, 0x31]]

inserts the single character 'A' into the database! This only makes sense when you realize that [92, 0x31, 0x30, 0x31] is the ascii representation of "\101", and "\101" is the octal representation of 'A'. Because the octal-backspace-escaped strings guarantee allowing values in the range 32-127 to be passed directly through (see the link Richard Huxton supplied in the comments for the details), the insert query does in fact work properly for standard english text, and could have gone unnoticed....

run conn "INSERT INTO documents VALUES (?)" [toSql $ B.pack [65]]

also inserts 'A'. Values higher than 127 aren't guaranteed to work, and are interpreted based on the character encoding used. If you look at the HDBC-postgresql code, or the logs of a query, you can see that it is setting the variable 'client_encoding' to utf8. The data coming in from the bytestring is therefore expected to be valid utf8, and complains when it sees a sequence that could not exist as a utf8 character.

The proper fix would be to wait for the bug to be fixed by the HDBC-postgresql guys, but in the meantime, you can use this code as a workaround....

import Data.ByteString.Internal
import Data.Char
import Data.Word8
import Numeric
import Text.Printf

convert::B.ByteString->B.ByteString
convert x = B.pack (convert' =<< B.unpack x)
          where
            convert'::Word8->[Word8]
            convert' 92 = [92, 92]
            convert' x | x >= 32 && x < 128 = [x]
            convert' x = 92:map c2w (printf "%03o" x) 

Now you can just use

run conn "INSERT INTO documents VALUES (?)" [toSql $ convert $ rawData mydoc]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top