Вопрос

I've implemented a non-blocking reader in Python, and I need to make it more efficient.

The background: I have massive amounts of output that I need to read from one subprocess (started with Popen()) and pass to another thread. Reading the output from that subprocess must not block for more than a few ms (preferably for as little time as is necessary to read available bytes).

Currently, I have a utility class which takes a file descriptor (stdout) and a timeout. I select() and readline(1) until one of three things happens:

  1. I read a newline
  2. my timeout (a few ms) expires
  3. select tells me there's nothing to read on that file descriptor.

Then I return the buffered text to the calling method, which does stuff with it.

Now, for the real question: because I'm reading so much output, I need to make this more efficient. I'd like to do that by asking the file descriptor how many bytes are pending and then readline([that many bytes]). It's supposed to just pass stuff through, so I don't actually care where the newlines are, or even if there are any. Can I ask the file descriptor how many bytes it has available for reading, and if so, how?

I've done some searching, but I'm having a really hard time figuring out what to search for, let alone if it's possible.

Even just a point in the right direction would be helpful.

Note: I'm developing on Linux, but that shouldn't matter for a "Pythonic" solution.

Это было полезно?

Решение

On Linux, os.pipe() is just a wrapper around pipe(2). Both return a pair of file descriptors. Normally one would use lseek(2) (os.lseek() in Python) to reposition the offset of a file decsriptor as a way to get the amount of available data. However, not all file descriptors capable of seeking.

On Linux trying lseek(2) on a pipe will return an error, see the manual page. That's because a pipe is more or less a buffer between a producer and a consumer of data. The size of that buffer is system dependant.

On Linux, a pipe has a 64 kB buffer, so that is the most data you can have available.

Edit: If you can change the way your subprocess works, you might consider using a memory mapped file, or a nice big piece of shared memory.

Edit2: Using polling objects is probably faster than select.

Другие советы

This question seems to offer a possible solution, though it may require retooling.

Non-blocking read on a subprocess.PIPE in python

Otherwise, I assume you know about reading data N bytes at a time:

all_data = ''
while True:
    data = pipe.read(1024)   # Reads 1024 bytes or to end of pipe
    if not data:
        break
    all_data += data
    # Add your timeout break here

You can find this out by calling os.fstat(file_descriptor) and checking the st_size property, which is the number of bytes written.

import os
reader_file_descriptor, writer_file_descriptor = os.pipe()
os.write(writer_file_descriptor, b'I am some data')
readable_bytes = os.fstat(writer_file_descriptor).st_size

I've implemented this based on the idea from spacether's answer

import select
import os

def readLen(p):
    # works on mac, might work on Linux, probably doesn't on windows (maybe return 1 in that case)
    size = os.fstat(p.fileno()).st_size
    return size

def readIfAny(p, timeout=1, default=None):
    if select.select([p], [], [], timeout)[0]:
        size = readLen(p)
        if size:
            return p.read(size)
    return default

....

import sys
data = readIfAny(sys.stdin)

Note that I've read in some places you should try to avoid reading and writing to a sub-process pipe directly like this to avoid deadlocks. but this is the safest way I've found so far.

Note 2: sys.stdin.read will return b'' or '' on eof i think. this doesn't seem to raise any exception, and i still don't really know how to tell when it finishes.

note 3: depending the mode in which they're open you get bytes or a string. also it works with stdin, stdout, and stderr.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top