Question

I have a piece of python code that reads from a txt file properly, but my colleague gave me another set of files that appears to be of type txt file as well. But when I ran the same python code, each line is read incorrectly. For the new files, if the line is 240,022414114120,-500,Bauer_HS5,0 It would be read as str:2[]4[]0 []0[]2[]2[]4..... All those little rectangles between each character and the leading question mark characters are all invalid characters. And it will further get converted to something like this: [['\xff\xfe2\x004\x000\x00', '\x000\x002\x002\x004\x001\x004\x001\x001\x004\x001\x002\x000\x00', '\x00-\x005\x000\x000\x00',...... However, if I manually create a normal text file and copy/paste the content from the input file, the parsr was able to read each line correctly. So I am thinking the input files are of different type of the normal text file. But the files' suffix are indeed 'txt'.

The files come from a device that regularly sends files to our server. This parser works fine for another device that does the same thing. And the files from both devices are all of type 'txt'.

Each line is read as {{{ for line in self._infile.xreadlines(): }}}

I am very confused why it would behave this way. My python code is following.

def __init__(self, infile=sys.stdin, outfile=sys.stdout):
    if isinstance(infile, basestring):
        infile = open(infile)
    if isinstance(outfile, basestring):
        outfile = open(outfile, "w")

    self._infile = infile
    self._outfile = outfile

def sort(self):
    lines = []
    last_second = None

    for line in self._infile.xreadlines():
        line = line.replace('\r\n', '')
        fields = line.split(',')
        if len(fields) < 2:
            continue
        second = fields[1]
        if last_second and second != last_second:
            lines = sorted(lines, self._sort_lines)
            self._outfile.write("".join([','.join(x) for x in lines]))
            #self._outfile.write("\r\n")
            lines = []

        last_second = second
        lines.append(fields)

    if lines:
        lines = sorted(lines, self._sort_lines)
        self._outfile.write("".join([','.join(x) for x in lines]))
        #self._outfile.write("\r\n")

    self._infile.close()
    self._outfile.close()
Was it helpful?

Solution

The start of the file you described as coming from your colleague is "\xff\xfe". These two characters make up a "byte order mark" that indicates that the file is encoded with the "UTF-16-LE" encoding (that is, 16-bit Unicode with the lower byte first). Your Python script is reading with an 8-bit encoding (probably whatever your system's default encoding is), so you're seeing lots of extra null characters (the high bytes of the 16-bit characters).

I can't speak to how the file got a different encoding. Windows text editors (like notepad.exe) are somewhat notorious for silently reencoding files in unhelpful ways if you're not careful with them, so it may be that your colleague previewed the file in an editor and then saved it before forwarding it on to you.

Anyway, the simplest fix is probably to reencode the file. There are various utilities to do this on various OSs, or you could write your own easily enough. Here's a quick and dirty function to reencode a file in Python (which will hopefully raise an exception if the encoding parameters are wrong, but perhaps not always):

def renecode_file(filename, from_encoding="UTF-16-LE", to_encoding="ascii"):
    with open(filename, "rb") as f:
        in_bytes = f.read() # read bytes

    text = in_bytes.decode(from_encoding) # decode to unicode

    out_bytes = text.encode(to_encoding) # reencode to new encoding

    with open(filename, "wb") as f:
        f.write(out_bytes) # write back to the file

If the file you get is going to always be encoded in UTF-16, you could change your regular script to decode it automatically. In Python 2.7, I'd suggest using the io module's open function for this (it is the same code that the regular open uses in Python 3). Note however that the file object returned won't support the xreadlines method which has been deprecated for a long time (just iterate over the file directly instead).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top