Question

I have the feeling that my question is related to Why does takewhile() skip the first line?

I haven't found satisfactory answers in there though.

My examples below use the following modules

import csv
from itertools import takewhile

Here is my problem. I have a csv file which I want to parse using itertools.

For instance, i want to separate the header from the content. This is spotted by the presence of a keyword in the first column.

Here is file.csv example

a, content
b, content
KEYWORD, something else
c, let's continue

The two first lines compose the header of the file. The KEYWORD line separates it from the content: the last line.

Even, if it is not properly part of the content, I want to parse the separation row.

with open('file.csv', 'rb') as f:
    reader = csv.reader(f)
    header = takewhile(lambda x: x[0] != 'KEYWORD', reader)
    for row in header:
        print(row)
    print('End of header')
    for row in reader:
        print(row)

I was not expecting this, but the KEYWORD line is skipped. As you will see in the following output:

['a', ' content']
['b', ' content']
End of header
['c', " let's continue"]

I have tried simulating the csv reader to see if it was coming from there. But apparently not. The following code produces the same behavior.

l = [['a', 'content'],
    ['b','content'],
    ['KEYWORD', 'something else'],
    ['c', "let's continue"]]

i = iter(l)
header = takewhile(lambda x: x[0] != 'KEYWORD', i)
for row in header:
    print(row)
print('End of header')
for row in i:
    print(row)

How can I do to use the feature of takewhile, while preventing the following for the skip the unparsed line ?

As I have understood, the first for calls for next on the iterator, to test its content. The second calls for next once again, to gather the value. And the separation row is hence skipped.

Was it helpful?

Solution 3

Thanks to @jonrsharpe, I came to question myself on some trick to code. Here is what I reached :

class RewindableFile(file):
    def __init__(self, *args, **kwargs):
        nb_backup = kwargs.pop('nb_backup', 1)
        super(RewindableFile, self).__init__(*args, **kwargs)
        self._nb_backup = nb_backup
        self._backups = []
        self._time_anchor = 0

    def next(self):
        if self._time_anchor >= 0:
            item = super(RewindableFile, self).next()
            self._backup(item)
            return item
        else:
            item = self._forward()
            return item

    def rewind(self):
        self._time_anchor = self._time_anchor - 1
        time_bound = min(self._nb_backup, len(self._backups))
        if self._time_anchor < -time_bound:
            raise Exception('You have gone too far in history...')

    def __iter__(self):
        return self

    def _backup(self, row):
        self._backups.append(row)
        extra_items = len(self._backups) - self._nb_backup
        if extra_items > 0:
            del self._backups[0:extra_items]

    def _forward(self):
        item = self._backups[self._time_anchor]
        self._time_anchor = self._time_anchor + 1
        return item

And how I use it :

with RewindableFile('csv.csv', 'rb') as f:
    def test_kwd_and_rewind(x):
        if x[0] != 'KEYWORD':
            return True
        else:
            f.rewind()
            return False

    reader = csv.reader(f)
    header = takewhile(test_kwd_and_rewind, reader)
    for row in header:
        print(row)
    print('End of header')
    for row in reader:
        print(row)

I could have also overload read and readline functions to save the jump. But I don't need them here.

OTHER TIPS

I think you will have to restructure - takewhile isn't a good fit for what you are doing. The problem is that takewhile has to read the line starting 'KEYWORD' to determine that it has reached a line it shouldn't take, and once the line is read the file's "read head" is at the start of the next line. Similarly, with iter, takewhile has already consumed (but discarded) the line starting 'KEYWORD' when you start for row in i.

One alternative would be something like:

header = []
content = []
target = header
for row in reader:
    if line.startswith('KEYWORD'):
        target = content
    target.append(row)

You can write your own takewhile like this.

def takewhile(predicate, iterable):
    for x in iterable:
        yield x
        if not predicate(x):
            break

test:

>>> list(takewhile(lambda x:x!=3, range(10)))
[0, 1, 2, 3]

jonrsharpe has it right. This isn't quite a job for takewhile. itertools also has a groupby function which can more easily handle the splitting. The LastHeaderclass below keeps a record of the last header line passed through the check method, and returns a reference to it each time check is called. This lets you run through the file a single time, without having to backtrack any.

class LastHeader():
    """Checks for new header strings. For use with groupby"""
    def __init__(self, sentinel='#'):
        self.sentinel = sentinel
        self.lastheader = ''

    def check(self, line):
        if line.startswith(self.sentinel):
            self.lastheader = line
        return self.lastheader

with open(fname, 'r') as fobj:
    lastheader = LastHeader(sentinel)
    for headerline, readlines in groupby(fobj, lastheader.check):
        foo(headerline)
        for line in readlines:
            bar(line)

where foo and bar are whatever processing you need to do on the headers and data.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top