Pregunta

I am trying to extract contents of a zip file of size ~500MB containing around 250K files.

Here's what I am trying to do -

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

import org.apache.commons.io.FileUtils;
import org.apache.commons.io.IOUtils;

import de.schlichtherle.truezip.file.TFile;
import de.schlichtherle.truezip.file.TFileInputStream;

public class ArchiveReaderExecutor {

    private final ExecutorService pool;

    public ArchiveReaderExecutor() {
        pool = Executors.newFixedThreadPool(8);
    }

    /**
     * Splits the archive file into list of lists as provided in the batch size
     * variable
     * 
     * @param archive
     * 
     * @return 
     */
    public List<List<TFile>> splitArchiveFile(final File archive) {
        final TFile tFile = new TFile(archive.getAbsolutePath());
        final ArrayList<TFile> individualFiles = new ArrayList<TFile>();
        recursivelyReadLeafnodes(tFile, individualFiles);
        final List<List<TFile>> returnList = new ArrayList<List<TFile>>();

        /*
         * Splitting the entire list into list of objects for batch processing
         */
        int count = 0;
        List<TFile> innerList = null;

        for (TFile splitFile : individualFiles) {
            if (count == 0) {
                innerList = new ArrayList<TFile>();
                returnList.add(innerList);
            }

            if (count < 100) {
                ++count;
            } else {
                count = 0;
            }
            innerList.add(splitFile);
        }
        return returnList;
    }

    public List<TFile> recursivelyReadLeafnodes(TFile inputTFile,
            ArrayList<TFile> individualFiles) {
        TFile[] tfiles = null;

        if (inputTFile.isArchive() || inputTFile.isDirectory()) {
            tfiles = inputTFile.listFiles();
        } else {
            tfiles = new TFile[0];
            tfiles[0] = inputTFile;
        }

        for (final TFile tFile : tfiles) {
            if (tFile.isFile() && !tFile.getName().startsWith(".")) {
                individualFiles.add(tFile);
            } else if (tFile.isDirectory()) {
                recursivelyReadLeafnodes(tFile, individualFiles);
            }
        }

        return individualFiles;
    }

    public void runExtraction() {

        File src = new File("Really_Big_File.zip");
        List<List<TFile>> files = splitArchiveFile(src);
        for (List<TFile> list : files) {
            pool.execute(new FileExtractorSavor(list));
        }
        pool.shutdown();

    }


    class FileExtractorSavor implements Runnable{
        List<TFile> files;
        public FileExtractorSavor(List<TFile> files) {
            this.files = files;
        }
        @Override
        public void run() {
            File file = null;
            TFileInputStream in = null;
            for (TFile tFile : files) {
                try {
                    in = new TFileInputStream(tFile);
                    file = new File("Target_Location"+tFile.getName());
                    FileUtils.writeStringToFile(file, IOUtils.toString(in));
                } catch (IOException e) {
                    e.printStackTrace();
                } finally {
                    IOUtils.closeQuietly(in);
                }
            }

        }

    }

    public static void main(String[] args) {
        new ArchiveReaderExecutor().runExtraction();
    }
}

When I am running this code concurrently, there are a lot of threads in wait/blocked state, here's the thread dump:

"pool-1-thread-7" prio=5 tid=7fd8093dd000 nid=0x11d3f3000 waiting for monitor entry [11d3f2000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at de.schlichtherle.truezip.socket.ConcurrentInputShop$SynchronizedConcurrentInputStream.close(ConcurrentInputShop.java:223)
    - waiting to lock <785460200> (a de.schlichtherle.truezip.fs.archive.FsDefaultArchiveController$Input)
    at de.schlichtherle.truezip.io.DecoratingInputStream.close(DecoratingInputStream.java:79)
    at org.apache.commons.io.IOUtils.closeQuietly(IOUtils.java:178)
    at ArchiveReaderExecutor$FileExtractorSavor.run(ArchiveReaderExecutor.java:136)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
    at java.lang.Thread.run(Thread.java:695)

   Locked ownable synchronizers:
    - <79ed370e0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
"pool-1-thread-5" prio=5 tid=7fd8093db800 nid=0x11d1ed000 waiting for monitor entry [11d1ec000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at de.schlichtherle.truezip.socket.ConcurrentInputShop$SynchronizedConcurrentInputStream.close(ConcurrentInputShop.java:223)
    - waiting to lock <785460200> (a de.schlichtherle.truezip.fs.archive.FsDefaultArchiveController$Input)
    at de.schlichtherle.truezip.io.DecoratingInputStream.close(DecoratingInputStream.java:79)
    at org.apache.commons.io.IOUtils.closeQuietly(IOUtils.java:178)
    at ArchiveReaderExecutor$FileExtractorSavor.run(ArchiveReaderExecutor.java:136)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
    at java.lang.Thread.run(Thread.java:695)

   Locked ownable synchronizers:
    - <79ed46468> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

I also tried using:

TFile.cp_r(src, dst, TArchiveDetector.NULL, TArchiveDetector.NULL);

It took much longer as it was running on a single thread.

My question, what's the fast, best and optimal way to extract contents of a zip file in java using TrueZip?

¿Fue útil?

Solución

There's nothing wrong here. TrueZIP/TrueVFS maintains a single file descriptor per mounted archive file. When multiple threads read the contents of the archive file concurrently, then the TrueZIP/TrueVFS Kernel serializes all access so that only one thread is using the file descriptor and updating its position at any time. All other threads will be blocked meanwhile.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top