shutil.copyfileobj method in python copies BOM character also while merging files

Question

shutil.copyfileobj() copies all data, regardless. A BOM is just data in the file, shutil is not and will not be aware of such file-format specific details.

You can easily skip the BOM yourself but leave the bulk of the copying to shutil.copyfileobj() still:

import codecs

for fd in source_fds_list:
   with open(destination_url, 'ab') as destn_fd:
       with fd:
           start = fd.read(2)
           if start != codecs.BOM_UTF16_LE:
               destn_fd.write(start)
           shutil.copyfileobj(fd, destn_fd)

By reading an initial 2 bytes from the source file first, shutil.copyfileobj() will continue to read everything else in the file, skipping the BOM. All shutil.copyfileobj() does under the hood is call data = source.read(buffer) and destination.write(data), anyway.

If you don't know the codecs used for the input files, you are stuck with heuristics. You can test for the various codecs BOM constants but the possibility of false-positives then arises; a file encoded with a codec other than UTF-* and initial bytes looking like a BOM:

for fd in source_fds_list:
   with open(destination_url, 'ab') as destn_fd:
       with fd:
           start = fd.read(4)

           if start not in (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE):
               if start[:3] != BOM_UTF8:
                   if start[:2] in (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE):
                       # UTF-16 BOM, skip 2 bytes
                       start = start[2:]
               else:
                   # UTF-8 BOM, skip 3 bytes
                   start = start[-1]
               # Not a UTF-32 BOM, write read bytes (minus skipped bytes)
               destn_fd.write(start)

           shutil.copyfileobj(fd, destn_fd)

shutil.copyfileobj method in python copies BOM character also while merging files

I have few solutions for the above scenario .