Question

I have about 200,000 text files that are placed in a bz2 file. The issue I have is that when I scan the bz2 file to extract the data I need, it goes extremely slow. It has to look through the entire bz2 file to fine the single file I am looking for. Is there anyway to speed this up?

Also, I thought about possibly organizing the files in the tar.bz2 so I can instead have it know where to look. Is there anyway to organize files that are put into a bz2?

More Info/Edit: I need to query the compressed file for each textfile. Is there a better compression method that supports such a large number of files and is as thoroughly compressed?

Was it helpful?

Solution

Do you have to use bzip2? Reading it's documentation, it's quite clear it's not designed to support random access. Perhaps you should use a compression format that more closely matches your requirements. The good old Zip format supports random access, but might compress worse, of course.

OTHER TIPS

Bzip2 compresses in large blocks (900 KiB by default, I believe). One method that would speed up the scanning of the tar file dramatically, but would reduce compression performance, would be to compress each file individually and then tar the results together. This is essentially what Zip-format files are (though using zlib compression rather than bzip2). But you could then easily grab the tar index and only have to decompress the specific file(s) you are looking for.

I don't think most tar programs offer much ability to organize files in any meaningful way, though you could write a program to do this for your special case (I know Python has tar-writing libraries though I've only used them once or twice). However, you'd still have the problem of having to decompress most of the data before you found what you were looking for.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top