Frage

Bonjour Stack0verflow

I am trying to get this code to write the data to stored_output without line 1 (title line)

What I have tried:

with open(filenamex, 'rb') as currentx:
    current_data = currentx.read()
    ## because of my filesize I dont want to go through each line the the route shown below to remove the first line (title row)
    for counter, line in enumerate(current_data):
        if counter != 0:
            data.writeline(line)
    #stored_output.writelines(current_data)

Because of the filesize I dont want to do a for loop (efficiency)

Any constructive comments or code snippets would be appreciated.
Thanks AEA

War es hilfreich?

Lösung

You can use next() on the file iterator to skip the first line and then write rest of the content using file.writelines:

with open(filenamex, 'rb') as currentx, open('foobar', 'w') as data:
    next(currentx)            #drop the first line
    data.writelines(currentx) #write rest of the content to `data`

Note: Don't use file.read() if you want to read a file line by line, simply iterate over the file object to get one line at a time.

Andere Tipps

You first problem is that currentx.read() returns one giant string, so looping over it loops over each of the characters in that string, not each of the lines in the file.

You can read a file into memory as a giant list of strings like this:

current_data = list(currentx)

However, this is almost guaranteed to be slower than iterating over the file a line at a time (because you waste time allocating memory for the whole file, rather than letting Python pick a reasonable-size buffer) or processing the whole file at once (because you're wasting time splitting on lines). In other words, you get the worst of both worlds this way.

So, either keep it as an iterator over lines:

next(currentx) # skip a line
for line in currentx:
    # do something with each line

… or keep it as a string and split off the first line:

current_data = currentx.read()
first, _, rest = current_data.partition('\n')
# do something with rest

What if it turns out that reading and writing a file at a time is too slow (which is likely—it forces the early blocks out of any cache before they can be written, prevents interleaving, and wastes time allocating memory), but a line at a time is also too slow (which is unlikely, but not impossible--searching for newlines, copying small strings, and looping in Python isn't free, it's just that CPU time is so much cheaper than I/O time that it rarely matters)?

The best you can do is pick an ideal block size and do unbuffered reads and writes yourself, and only waste time searching for newlines until you find the first one.

If you can assume that the first line will never be longer than the block size, this is pretty easy:

BLOCK_SIZE = 8192 # a usually-good default--but if it matters, test
with open(inpath, 'rb', 0) as infile, open(outpath, 'wb', 0) as outfile:
    buf = infile.read(BLOCK_SIZE)
    first, _, rest = buf.partition(b'\n')
    outfile.write(rest)
    while True:
        buf = infile.read(BLOCK_SIZE)
        if not but:
            break
        outfile.write(buf)

If I were going to do that more than once, I'd write a block file iterator function (or, better, look for a pre-tested recipe—they're all over ActiveState and the mailing lists).

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top