Python 3: dealing with stripping lines in binary mode

https://stackoverflow.com/questions/12216496

29-06-2021
|

Question

with the help of SO members, i was able to reach up to as following, Following is sample code, aim is just to merges text files from give folder and it's sub folder and store output as master.txt. but i am getting traceback occasionally, looks like While reading the file it throws an error.

considering suggestions, inputs and some research it would be good idea to clean up text file in uniform unicode or employ some line by line function, so reading each line should be trimmed garbage characters and empty lines.

import shutil
import os.path

root = 'C:\\Dropbox\\test\\'
files = [(path,f) for path,_,file_list in os.walk(root) for f in file_list]

with open('C:\\Dropbox\\Python\\master.txt','wb') as output:
    for path, f_name in files:
        with open(os.path.join(path, f_name), 'rb') as input:
            shutil.copyfileobj(input, output)
        output.write(b'\n') # insert extra newline 

with open('master.txt', 'r') as f:
  lines = f.readlines()
with open('master.txt', 'w') as f:
  f.write("".join(L for L in lines if L.strip()))

Traceback I get:

Traceback (most recent call last):
  File "C:\Dropbox\Python\master1.py", line 14, in <module>
    lines = f.readlines()
  File "C:\PYTHON32\LIB\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 8159: character maps to <undefined>

Solution

You've opened master.txt in text mode. When you then readlines() from it, it will decode them with the default encoding for your system. Apparently the file is in another decoding, as you get a UnicodeDecodeError.

Either open the file in binary mode, or specify the correct encoding.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow