Iterator using itertools is skipping a line

Question 1

Thanks to @jonrsharpe, I came to question myself on some trick to code. Here is what I reached :

class RewindableFile(file):
    def __init__(self, *args, **kwargs):
        nb_backup = kwargs.pop('nb_backup', 1)
        super(RewindableFile, self).__init__(*args, **kwargs)
        self._nb_backup = nb_backup
        self._backups = []
        self._time_anchor = 0

    def next(self):
        if self._time_anchor >= 0:
            item = super(RewindableFile, self).next()
            self._backup(item)
            return item
        else:
            item = self._forward()
            return item

    def rewind(self):
        self._time_anchor = self._time_anchor - 1
        time_bound = min(self._nb_backup, len(self._backups))
        if self._time_anchor < -time_bound:
            raise Exception('You have gone too far in history...')

    def __iter__(self):
        return self

    def _backup(self, row):
        self._backups.append(row)
        extra_items = len(self._backups) - self._nb_backup
        if extra_items > 0:
            del self._backups[0:extra_items]

    def _forward(self):
        item = self._backups[self._time_anchor]
        self._time_anchor = self._time_anchor + 1
        return item

And how I use it :

with RewindableFile('csv.csv', 'rb') as f:
    def test_kwd_and_rewind(x):
        if x[0] != 'KEYWORD':
            return True
        else:
            f.rewind()
            return False

    reader = csv.reader(f)
    header = takewhile(test_kwd_and_rewind, reader)
    for row in header:
        print(row)
    print('End of header')
    for row in reader:
        print(row)

I could have also overload read and readline functions to save the jump. But I don't need them here.

Question 2

I think you will have to restructure - takewhile isn't a good fit for what you are doing. The problem is that takewhile has to read the line starting 'KEYWORD' to determine that it has reached a line it shouldn't take, and once the line is read the file's "read head" is at the start of the next line. Similarly, with iter, takewhile has already consumed (but discarded) the line starting 'KEYWORD' when you start for row in i.

One alternative would be something like:

header = []
content = []
target = header
for row in reader:
    if line.startswith('KEYWORD'):
        target = content
    target.append(row)

Question 3

You can write your own takewhile like this.

def takewhile(predicate, iterable):
    for x in iterable:
        yield x
        if not predicate(x):
            break

test:

>>> list(takewhile(lambda x:x!=3, range(10)))
[0, 1, 2, 3]

Question 4

jonrsharpe has it right. This isn't quite a job for takewhile. itertools also has a groupby function which can more easily handle the splitting. The LastHeaderclass below keeps a record of the last header line passed through the check method, and returns a reference to it each time check is called. This lets you run through the file a single time, without having to backtrack any.

class LastHeader():
    """Checks for new header strings. For use with groupby"""
    def __init__(self, sentinel='#'):
        self.sentinel = sentinel
        self.lastheader = ''

    def check(self, line):
        if line.startswith(self.sentinel):
            self.lastheader = line
        return self.lastheader

with open(fname, 'r') as fobj:
    lastheader = LastHeader(sentinel)
    for headerline, readlines in groupby(fobj, lastheader.check):
        foo(headerline)
        for line in readlines:
            bar(line)

where foo and bar are whatever processing you need to do on the headers and data.