Question

I have successfully read pdf file in tar.gz format folder . But I faced performance issue - taking more time to open tar.gz folder containing more than 1000 small pdf files each file size 10 - 25 MB .Total size of the folder is 2GB

How to improve the performance of unzip file reading ?

FileInputStream fin = new FileInputStream(tarName);
BufferedInputStream in = new BufferedInputStream(fin);
GzipCompressorInputStream gzIn = new GzipCompressorInputStream(in);
TarArchiveInputStream tarIn = new TarArchiveInputStream(gzIn);
TarArchiveInputStream tarIn1 = new TarArchiveInputStream(tarIn);
TarArchiveEntry entry = null;

byte[] buffer = new byte[5024];
int nrBytesRead;

while ((entry = (TarArchiveEntry) tarIn1.getNextEntry()) != null) {
    System.out.println("it finds a file "
                         + entry.getName().toString());
    if (entry.getName().toString().equals(fileName)) {

        while ((nrBytesRead = tarIn1.read(buffer)) > 0) {

            out.write(buffer, 0, nrBytesRead);
        }
        break;
    }
}
Was it helpful?

Solution

The tar.gz format was not designed for random access. It was intended for when the entire package would be unpacked at once. That allows the series of files to be compressed as a single stream, generally improving compression, especially for many small files. However if you try to pull out just one file from the middle somewhere, you will need to first decompress all the files up to that point.

For random access to individual files, you should consider repackaging using the zip format. The compression won't be as good, put you can pluck out individual files very quickly. In Java, look at the ZipFile class.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top