Function for discovering encoding from BOM

https://stackoverflow.com/questions/13584348

02-12-2021
|

Question

I was wondering whether the python library has a function that returns a file's character encoding by looking for the presence of a BOM.

I've already implemented something, but I'm just afraid I might be reinventing the wheel

Update: (based on John Machin's correction):

import codecs

def _get_encoding_from_bom(fd):
    first_bytes = fd.read(4)
    fd.seek(0)
    bom_to_encoding = (
        (codecs.BOM_UTF32_LE, 'utf-32'),
        (codecs.BOM_UTF32_BE, 'utf-32'),
        (codecs.BOM_UTF8, 'utf-8-sig'),
        (codecs.BOM_UTF16_LE, 'utf-16'),
        (codecs.BOM_UTF16_BE, 'utf-16'),
        )
    for bom, encoding in bom_to_encoding:
        if first_bytes.startswith(bom):
             return encoding
    return None

Solution

Your code has a subtle bug that you may never be bitten by, but it's best that you avoid it.

You are iterating over a dictionary's keys. The order of iteration is NOT guaranteed by Python. In this case order does matter.

codecs.BOM_UTF32_LE is '\xff\xfe\x00\x00'
codecs.BOM_UTF16_LE is '\xff\xfe'

If your file is encoded in UTF-32LE but UTF-16LE just happens to be tested first, you will incorrectly state that the file is encoded in UTF-16LE.

To avoid this, you can iterate over a tuple that is ordered by BOM-length descending. See sample code in my answer to this question.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow