Question

I am using Avro 1.4.0 to read some data out of S3 via the Python avro bindings and the boto S3 library. When I open an avro.datafile.DataFileReader on the file like objects returned by boto it immediately fails when it tries to seek(). For now I am working around this by reading the S3 objects into temporary files.

I would like to be able to stream through any python object that supports read(). Can anybody provide advice?

Was it helpful?

Solution

I am not very clear on this and this may not be the answer. I was of the impression that

diter = datafile.DataFileReader(..) 

returns an iterator so that you could do the following

for data in diter:
    ....

Correct me, if I am wrong here.

Revisiting my answer:

You are right, datafile.DataFileReader does not play well with a reader for which seek would fail.

it uses avro.io.BinaryDecoder which accepts a reader.

class BinaryDecoder(object):
    """Read leaf values."""
    def __init__(self, reader):
        """
    reader is a Python object on which we can call read, seek, and tell.
    """
    self._reader = reader

What you can do is create your own reader class that does provide these functions - read , seek and tell but internally utilizes boto S3 library to read of data.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top