Question

I want to copy contents of a .tar.gz files to 2 folder , It has around 20 files and total unzipped size will be >20 GB.
I used Truezip for this .

 TFile archive = new TFile(absoluteZipName); // archive with .tar.gz
    TFile[] archFiles = archive.listFiles(); // takes too much time 
    for (TFile t : archFiles) {
         String fileName = t.getName();
          if(fileName.endsWith(".dat"))
              t.cp(new File(destination1+ t.getName()));
          else if(fileName.endsWith(".txt")){
               t.cp(new File(destination2+ t.getName()));
          }
    }
 It takes 3 times above tar xzf command (untar linux) . Have any way to optimize this code for fast copying, memory not an issue.  

    The following code allows fast copying Thanks npe for the good advice.
    (NB: I have no previledge to post the answe now that's why editing question itself)

InputStream is = new FileInputStream(absoluteZipName);
            ArchiveInputStream input = new ArchiveStreamFactory()
               .createArchiveInputStream(ArchiveStreamFactory.TAR, new GZIPInputStream(is));

            ArchiveEntry entry;
            while ((entry = input.getNextEntry()) != null) {
                OutputStream outputFileStream=null;
                if(entry.getName().endsWith(".dat")){
                 File outFile1= new File(destination1, entry.getName());
                     outputFileStream = new FileOutputStream(outFile1); 
                }
                else if(entry.getName().endsWith(".txt")){
                File outFile2= new File(destination2, entry.getName());
                     outputFileStream = new FileOutputStream(outFile2);   
                }
                // use ArchiveEntry#getName() to do the conditional stuff...
                IOUtils.copy(input, outputFileStream,10485760);
            }


    Is threading In file copy will reduce time..? In TZip didn't reduced as they already threading it. anyway I will try tomorrow and will let you Know.
Was it helpful?

Solution 3

Thanks npe , this is the final I have done, any way it tooks less time than tar xzf. Final code snippet like this.

InputStream is = new FileInputStream(absoluteZipName);
ArchiveInputStream input = new ArchiveStreamFactory()
   .createArchiveInputStream(ArchiveStreamFactory.TAR, new GZIPInputStream(is));

ArchiveEntry entry;
while ((entry = input.getNextEntry()) != null) {
    OutputStream outputFileStream=null;
    if(entry.getName().endsWith(".dat")){
     File outFile1= new File(destination1, entry.getName());
         outputFileStream = new FileOutputStream(outFile1); 
    }
    else if(entry.getName().endsWith(".txt")){
    File outFile2= new File(destination2, entry.getName());
         outputFileStream = new FileOutputStream(outFile2);   
    }
    // use ArchiveEntry#getName() to do the conditional stuff...
    IOUtils.copy(input, outputFileStream,10485760);
}

Hope I can do some more optimizations, will do later. Thanks a lot

OTHER TIPS

It seems that the listFiles() decompresses your gzip file in order to be able to scan through the tar file to get all the filenames, and then cp(File, File) scans it again to position the stream on given file.

What I'd do is use Apache Commons Compress and do a iterator-like scan on the inputstreams, sort of like this:

InputStream is = new FileInputStream("/path/to/my/file");
ArchiveInputStream input = new ArchiveStreamFactory()
   .createArchiveInputStream(ArchiveStreamFactory.TAR, new GZIPInputStream(is));

ArchiveEntry entry;
while ((entry = input.getNextEntry()) != null) {

    // use ArchiveEntry#getName() to do the conditional stuff...

}

Read the javadoc for ArchiveInputStream#getNextEntry() and ArchiveEntry for more info.

The reason for the performance issue that you've witnessed is that the TAR file format lacks a central directory. But because TrueZIP is a virtual file system and it cannot predict the access pattern of the client application, it has to unzip the entire TAR file to a temporary directory upon first access. This is what happens on TFile.listFiles(). Then you copy the entries from the temporary directory to the target directories. So all in all each entry byte will be read or written four times.

To get best performance, you have two options:

(a) You could switch to the ZIP file format and stick with the TrueZIP File* API. ZIP files have a Central Directory, so reading them does not involve creating temporary files.

(b) You could process the TAR.GZ file as a stream like shown by npe. I would then combine this with a java.util.zip.GZIPInputStream because that implementation is based on fast C code. I would also use TrueZIP's Streams.copy(InputStream, OuputStream) method because it will use multithreading for really fast bulk copying.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top