Question

I am trying to concatenate a number of .txt files in Hebrew from a single folder into a single file. The encoding is cp1255, for Hebrew. I specified the coding, so it succeeds in opening the file, but the coding then fails when trying to write the string to the file. If I don't specify the encoding at the open command, the open itself fails (on line 7).

dirLoc="source/folder"
import os
files=os.listdir(dirLoc)
for f in files:
if f.endswith('.txt'):
    print(f)
    data=open(dirLoc+'/'+f, 'r', encoding="cp1255")
    out=open("outPut.txt", 'a')
    for line in data:
        out.write(line)
    data.close()
    out.close()

The error I get is the standard: UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to undefined

Edit: Having played around with it some more, the problem seems to definitely be with writing a Hebrew string to the .txt file. This is true even if I resave the file in a different format (such as ANSI or utf-8) and change the encoding accordingly. It also works fine with .txt files in English.

Was it helpful?

Solution

Okay, having played around with this for another day, I found a solution, as follows:

dirLoc='source/folder'
import os
import codecs
files=os.listdir(dirLoc)
for f in files:
    if f.endswith('.txt'):
        data=codecs.open(dirLoc+'/'+f, 'r+', encoding='utf8')
    try:
        data1=data.read()
        out=codecs.open(dirLoc+'/outPut.txt', 'a+', encoding='utf8')
        try:
            out.write(data1)
        except:
            print('file ' +f+ ' failed to write')
    except:
        print('file '+f+' failed to read')
    out.close()       
    data.close()

The codecs.open allows me to specify encoding for the write function as well as the read - note that you have to import codecs in order to use it. The exceptions are there because the encoding is still a problem and the occasional file throws an exception. The try allows me to skip any file that does fail to read or write without failing altogether.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top