Can't convert file from UTF-16 to UTF-8 and remove BOM

https://stackoverflow.com/questions/21270412

30-09-2022
|

Question

I have a file in utf-16, what I want is to convert it to utf-8 and remove BOM. The code below works fine for converting, but I can't figure out how to remove BOM in most efficient way.

def convert_to_utf8(event):                                                     
     blocksize = 1048576                                                         
     output_file = add_timestamp(event.pathname)                                 
     with open(event.pathname, 'r') as char_set:                                 
         enc = chardet.detect(char_set.read(blocksize))['encoding']              
         print enc                                                               

     with codecs.open(event.pathname, 'rb', encoding = enc) as encoded_file:        
         with codecs.open(output_file, "w+b", encoding = 'utf-8') as utf8_file:  
             while True:                                                         
                 content = encoded_file.read(blocksize)                          
                 if not content:                                                 
                     break                                                       
                 #if content.startswith(codecs.BOM_UTF8):                        
                 #    content.replace(codecs.BOM_UTF8, '')                       
                 utf8_file.write(content)

That's the initial file:

$ file test_16.csv -bi
text/plain; charset=utf-16le

And that's the file after:

file -bi test_16-1390343202.csv
text/plain; charset=utf-8

ANd that's how I check BOM:

>>> with open('test_16-1390343202.csv', 'rb') as f:
...     repr(f.readline())

"'\\xef\\xbb\\xbfFOO,BAR,BAZ\\r\\n'"

Solution

You had the right idea with the commented-out code, it just needs a little tweaking. Once you're read the BOM using the codec, it's no longer a 3-byte UTF-8 sequence or even a UTF-16 code, it's a single Unicode character U+FEFF.

if content[0] == U'\uFEFF':
    content = content[1:]

Also note that the replace function wouldn't have worked since it doesn't do an in-place replacement - it can't since strings in Python are immutable. You can assign the result back to itself. Since we know it's only a single character, I simplified it with a slice.

OTHER TIPS

Read a single character before looping. If it's not the BOM then write it out, otherwise ignore it.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow