Question

I am working with a network library that returns a generator where you receive an arbitrary amount of text (as a string) with each Next() call; where if you simply concatenated the result of every Next() call; would look like a standard English text document.

There could be multiple newlines in the string returned from each Next() call, there could be none. The returned string doesn't necessarily end in a newline, i.e. one line of text could be spread across multiple Next() calls.

I am trying to use this data in a 2nd library that needs Next() to return one line of text. It is absolutely critical I do not read in the entire stream; this can be tens of gigabytes of data.

Is there a built-in library to solve this problem? If not, can someone suggest the best way to either write the generator or an alternative way to solve the problem?

Était-ce utile?

La solution

Write a generator function that pulls the chunks down and splits them into lines for you. Since you won't know if the last line ended in a newline or not, save it and attach it to the next chunk.

def split_by_lines(text_generator):
    last_line = ""
    try:
        while True:
             chunk = "".join(last_line, next(text_generator))
             chunk_by_line = chunk.split('\n')
             last_line = chunk_by_line.pop()
             for line in chunk_by_line:
                 yield line
    except StopIteration: # the other end of the pipe is empty
        yield last_line
        raise StopIteration

Autres conseils

After reading your edit, maybe you could modify the stream object which returns arbitrary amounts of text? For example, in the stream.next() method, there is some way the stream generates a string and yields it when .next() is called. Could you do something like:

def next(self):
    if '\n' in self.remaining:
        terms = self.remaining.split('\n')
        to_yield, self.remaining = terms[0], ''.join(terms[1:])
        yield to_yield
    else:
        to_yield = self.remaining + self.generate_arbitrary_string()
        while '\n' not in to_yield:
            to_yield += self.generate_arbitrary_string()
        to_yield, self.remaining = terms[0], ''.join(terms[1:])
        yield to_yield        

This pseudocode assumes that the stream object generates some arbitrary string with generate_arbitrary_string(). On your first call of next(), the self.remaining string should be empty, so you go to the else statement. There, you concatenate arbitrary strings until you find a newline character, split the concatenated string at the first newline character, yield the first half and store the second half in remaining.

On subsequent calls of next(), you first check if self.remaining contains any newline characters. If so, yield the first line and store the rest. If not, append a new arbitrary string to self.remaining and continue like above.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top