Question

I'm looking for a general compression library that supports random access during decompression. I want to compress wikipedia into a single compressed format and at the same time I want to decompress/extract individual articles from it.

Of course, I can compress each articles individually, but this won't give much compression ratio. I've heard LZO compressed file consists of many chunks which can be decompressed separately, but I haven't found out API+documentation for that. I can also use the Z_FULL_FLUSH mode in zlib, but is there any other better alternative?

Was it helpful?

Solution

xz-format files support an index, though by default the index is not useful. My compressor, pixz, creates files that do contain a useful index. You can use the functions in the liblzma library to find which block of xz data corresponds to which location in the uncompressed data.

OTHER TIPS

for seekable compression build on gzip, there is dictzip from the dict server and sgzip from sleuth kit

note that you can't write to either of these and as seekable is reading any way

DotNetZip is a zip archive library for .NET.

Using DotNetZip, you can reference particular entries in the zip randomly, and can decompress them out of order, and can return a stream that decompresses as it extracts an entry.

With the benefit of those features, DotNetZip has been used within the implementation of a Virtual Path Provider for ASP.NET, that does exactly what you describe - it serves all the content for a particular website from a compressed ZIP file. You can also do websites with dynamic pages (ASP.NET) pages.

ASP.NET ZIP Virtual Path Provider, based on DotNetZip

The important code looks like this:

namespace Ionic.Zip.Web.VirtualPathProvider
{
    public class ZipFileVirtualPathProvider : System.Web.Hosting.VirtualPathProvider
    {
        ZipFile _zipFile;

        public ZipFileVirtualPathProvider (string zipFilename) : base () {
            _zipFile =  ZipFile.Read(zipFilename);
        }

        ~ZipFileVirtualPathProvider () { _zipFile.Dispose (); }

        public override bool FileExists (string virtualPath)
        {
            string zipPath = Util.ConvertVirtualPathToZipPath (virtualPath, true);
            ZipEntry zipEntry = _zipFile[zipPath];

            if (zipEntry == null)
                return false;

            return !zipEntry.IsDirectory;
        }

        public override bool DirectoryExists (string virtualDir)
        {
            string zipPath = Util.ConvertVirtualPathToZipPath (virtualDir, false);
            ZipEntry zipEntry = _zipFile[zipPath];

            if (zipEntry != null)
                return false;

            return zipEntry.IsDirectory;
        }

        public override VirtualFile GetFile (string virtualPath)
        {
            return new ZipVirtualFile (virtualPath, _zipFile);
        }

        public override VirtualDirectory GetDirectory (string virtualDir)
        {
            return new ZipVirtualDirectory (virtualDir, _zipFile);
        }

        public override string GetFileHash(string virtualPath, System.Collections.IEnumerable virtualPathDependencies)
        {
            return null;
        }

        public override System.Web.Caching.CacheDependency GetCacheDependency(String virtualPath, System.Collections.IEnumerable virtualPathDependencies, DateTime utcStart)
        {
            return null;
        }
    }
}

And VirtualFile is defined like this:

namespace Ionic.Zip.Web.VirtualPathProvider
{
    class ZipVirtualFile : VirtualFile
    {
        ZipFile _zipFile;

        public ZipVirtualFile (String virtualPath, ZipFile zipFile) : base(virtualPath) {
            _zipFile = zipFile;
        }

        public override System.IO.Stream Open () 
        {
            ZipEntry entry = _zipFile[Util.ConvertVirtualPathToZipPath(base.VirtualPath,true)];
            return entry.OpenReader();
        }
    }
}

bgzf is the format used in genomics. http://biopython.org/DIST/docs/api/Bio.bgzf-module.html

It is part of the samtools C library and really just a simple hack around gzip. You can probably re-write it yourself if you don't want to use the samtools C implementation or the picard java implementation. Biopython implements a python variant.

You haven't specified your OS. Would it be possible to store your file in a compressed directory managed by the OS? Then you would have the "seekable" portion as well as the compression. The CPU overhead will be handled for you with unpredictable access times.

I'm using MS Windows Vista, unfortunately, and I can send the file explorer into zip files as if they were normal files. Presumably it still works on 7 (which I'd like to be on). I think I've done that with the corresponding utility on Ubuntu, also, but I'm not sure. I could also test it on Mac OSX, I suppose.

If individual articles are too short to get a decent compression ratio, the next-simplest approach is to tar up a batch of Wikipedia articles -- say, 12 articles at a time, or however many articles it takes to fill up a megabyte. Then compress each batch independently.

In principle, that gives better compression than than compressing each article individually, but worse compression than solid compression of all the articles together. Extracting article #12 from a compressed batch requires decompressing the entire batch (and then throwing the first 11 articles away), but that's still much, much faster than decompressing half of Wikipedia.

Many compression programs break up the input stream into a sequence of "blocks", and compress each block from scratch, independently of the other blocks. You might as well pick a batch size about the size of a block -- larger batches won't get any better compression ratio, and will take longer to decompress.

I have experimented with several ways to make it easier to start decoding a compressed database in the middle. Alas, so far the "clever" techniques I've applied still have worse compression ratio and take more operations to produce a decoded section than the much simpler "batch" approach.

For more sophisticated techniques, you might look at

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top