Specifying chars in python

https://stackoverflow.com/questions/1925043

20-09-2019
|

Question

I need a functions that iterates over all the lines in the file.
Here's what I have so far:

def LineFeed(file):
    ret = ""
    for byte in file:
        ret = ret + str(byte)
        if str(byte) == '\r':
            yield ret
            ret = ""

All the lines in the file end with \r (not \n), and I'm reading it in "rb" mode, (I have to read this file in binary). The yield doesn't work and returns nothing. Maybe there's a problem with the comparison? I'm just not sure how you represent a byte/char in python.

I'm getting the idea that if you for loop on a "rb" file it still tries to iterate over lines not bytes..., How can I iterate over bytes? My problem is that I don't have standard line endings. Also my file is filled with 0x00 bytes and I would like to get rid of them all, so I think I would need a second yield function, how could I implement that, I just don't know how to represent the 0x00 byte in python or the NULL char.

Solution

If you're in control of how you open the file, I'd recommend opening it with universal newlines, since \r isn't recognized as a linefeed character if you just use 'rb' mode, but it is if you use 'Urb'.

This will only work if you aren't including \n as well as \r in your binary file somewhere, since the distinction between \r and \n is lost when using universal newlines.

Assuming you want your yielded lines to still be \r terminated:

NUL = '\x00'
def lines_without_nulls(path):
    with open(path, 'Urb') as f:
        for line in f:
            yield line.replace(NUL, '').replace('\n', '\r')

OTHER TIPS

I think that you are confused with what "for x in file" does. Assuming you got your handle like "file = open(file_name)", byte in this case will be an entire line, not a single character. So you are only calling yield when the entire line consists of a single carriage return. Try changing "byte" to "line" and iterating over that with a second loop.

Perhaps if you were to explain what this file represents, why it has lots of '\x00', why you think you need to read it in binary mode, we could help you with your underlying problem.

Otherwise, try the following code; it avoids any dependence on (or interference from) your operating system's line-ending convention.

lines = open("the_file", "rb").read().split("\r")
for line in lines:
    process(line)

Edit: the ASCII NUL (not "NULL") byte is "\x00".

So, your problem is iterating over the lines of a file open in binary mode that use '\r' as a line separator. Since the file is in binary mode, you cannot use the universal newline feature, and it turns out that '\r' is not interpreted as a line separator in binary mode.

Reading a file char by char is a terribly inefficient thing to do in Python, but here's how you could iterate over your lines:

def cr_lines(the_file):
    line = []
    while True:
        byte = the_file.read(1)
        if not byte:
            break
        line.append(byte)
        if byte == '\r':
            yield ''.join(line)
            line = []
    if line:
        yield ''.join(line)

To be more efficient, you would need to read bigger chunks of text and handle buffering in your iterator. Keeping in mind that you could get strange bugs if seeking while iterating. Preventing those bugs would require a subclass of file so you can purge the buffer on seek.

Note the use of the ''.join(line) idiom. Accumulating a string with += has terrible performance and is common mistake made by beginning programmers.

Edit:

string1 += string2 string concatenation is slow. Try joining a list of strings.
ddaa is right--You shouldn't need the struct package if the binary file only contains ASCII. Also, my generator returns the string after the final '\r', before EOF. With these two minor fixes, my code is suspiciously similar (practically identical) to this more recent answer.

Code snip:

def LineFeed(f):
    ret = []
    while True:
        oneByte = f.read(1)
        if not oneByte: break
        # Return everything up to, but not including the carriage return
        if oneByte == '\r':
            yield ''.join(ret)
            ret = []
        else:
            ret.append(oneByte)
    if oneByte:
        yield ''.join(ret)
if __name__ == '__main__':
    lf = LineFeed( open('filename','rb') )

    for something in lf:
        doSomething(something)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow