Question

I'm currently working on a 3rd year project involving data from Twitter. The department have provided me with .lzo's of a months worth of Twitter. The smallest is 4.9gb and when decompressed is 29gb so I'm trying to open the file and read as I'm going. Is this possible or do I need to decompress and work with the data that way?

EDIT: Have attempted to read it line by line and decompress the read line

UPDATE: Found a solution - reading the STDOUT of lzop -dc works like a charm

Was it helpful?

Solution

How about starting an lzop binary in a subprocess with -c switch and then read its STDOUT line by line?

OTHER TIPS

I know only one library for LZO with Python — https://github.com/jd-boyd/python-lzo and it requires full decompression (moreover — it decompress contents in memory).

So I think you'll need to decompress files before work with them.

I know this is a very old question and the answer is really good. I enchountered a samilar problem, google brought me here.

I just write down my experience on lzo compression and lzop program. Hope I can help someone like me encounter the same quesion. And I write a simple python module to deal with lzo file, you can find it on https://github.com/ir193/python-lzo/

Regarding the quesion, reading lzo compressed file in place (without decompress the whole file) can be done by reading one block at one time. The lzo file is divided into serveral blocks and there is a maximum size of the block about serveral MB. In my module, you can just using read(4096) or so.

Actually *.lzo is created by lzop and has little to do with the python-lzo provided by another answer (https://github.com/jd-boyd/python-lzo). This module is used for compress/decompress string, not handle lzop file header and checksum. Don't use it if you want to open some exist lzo file.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top