Question

I have over 100.000 files and almost 4GB of them. Its html so it can be compressed for 70-80%. The files ranges from 200 KB to almost 10 MB.

I am developing an application transforming the files from xml to html. In the end the application will archive the html directory into a zip file.

I have used a maven plugin called "copy-maven-plugin". The documentation for this plugin is very good and it was easy to use. The archive functionality the plugin uses is by default "ant zip" but you can change it to use TrueZip. For unpacking its the oposite. Anyway I tried to pack mye monster folder both ways. The default Ant zip used 43 minutes and TrueZip 38 minutes. Both way to much in my opinion.

Then I tried the same in my commandline with "zip -r archive folder" and that took only 4 minutes. EDIT: Have not been able to get zip under 40 min lately. I think maybe that the 4 min one might have ended up with a corrupt zip.

So I was thinking that java might not be that good when it comes to processing this amount of files.

Does anyone know or have any experience with this kind of problem?

I am thinking of maybe implement the thing my self and by changing the byte read size it would help? I know you can limit the chunk of data read by using ZipInputStream/ZipOutputStream with Zip4j to create/unzip the zip file and using your own buffer size but I have not tried it. When it takes like forever I can't keep waiting to find out ;-)

As of last night maven calls exec on a zipIt.sh (zip -r ...) to do the work within reasonable time but I would like to give java the benifit of doubt.

Update 1 I have testet different approaches (all default compression level):

  1. zip4j from java. It took only 3 minutes. But the file was corrupt. Seems zip4j do not handle this amount of files.
  2. Ant zip (via a maven plugin). Compression: arround 980MB. Slow speed: around 40 min
  3. tar + xz from commandline. Compression: 567MB. Poor speed: 63 min.
  4. zip from commandline. Compression: 981MB. Speed: 40 min
  5. tar + bz2 from commandline. Compression: 602MB. Speed: 13 min
  6. tar + gz from commandline. Compression: 864MB. Speed: 5 min
  7. java.util.zip.ZipOutputStream. Compression: 986MB. Speed: blazing 4 min 18 sec

Both tar+bz2 and tar+gz seems to be good alternatives and gives me options weither compression or speed is most important.

I had not testet jdk7.0 ZipOutputStream but as it seems I might have solved it. I used a read buffer at 64KB (64*1024) and it seems to work like a charm. Seems I struck gold with java after all :-)

This is my implementation

public static void main(String [] args) {
    String outputFile = Config.getProperty(Property.ZIP_FILE);
    String folderToAdd = Config.getProperty(Property.HTML_FOLDER);
    Zipper.zip(outputFile, folderToAdd, 64*1024);
}

private static void zip(String zipFile, String sourceDirectory, int readChunk) {

    ZipOutputStream out = null;
    try {

        //create byte buffer
        byte[] buffer = new byte[readChunk];

        File dirObj = new File(sourceDirectory);
        out = new ZipOutputStream(new FileOutputStream(zipFile));
        logger.info("Creating zip {} with read buffer '{}'", zipFile, readChunk);
        addDir(dirObj, out, buffer);

    } catch (IOException ioe) {
        logger.error("IOException :" + ioe);
    } finally {
        IOUtils.closeQuietly(out);
    }

}

private static void addDir(File dirObj, ZipOutputStream out, byte [] readChunk) throws IOException {
    logger.debug("Zipping folder '{}'", dirObj.getName());
    StopWatch watch = new StopWatch();
    watch.start();

    File[] files = dirObj.listFiles();

    for (File file : files != null ? files : new File[0]) {
        if (file.isDirectory()) {
            addDir(file, out, readChunk);
            continue;
        }
        FileInputStream in = null;
        try {
            in = new FileInputStream(file.getAbsolutePath());
            out.putNextEntry(new ZipEntry(file.getAbsolutePath()));
            int len;
            while ((len = in.read(readChunk)) > 0) {
                out.write(readChunk, 0, len);
            }

        } finally {
            out.closeEntry();
            IOUtils.closeQuietly(in);
        }
    }
    watch.stop();
    logger.debug("Zipped folder {} in {} seconds.", dirObj.getName(), watch);
}
Was it helpful?

Solution

I doubt the problem is the number of files as such. You need to be able to manipulate the ZIP entries without unpacking and repacking all the entries. This can make a significant difference. I would expect about a 10x difference. This could be done in Java but I suspect most libraries are not designed for this.

What you can do is call zip from Java if this appears to do what you want. A number of maven plugins using command line tools (esp those for version control)

BTW You may get better compression using something like tar + bz2. This compresses more by compressing the whole archive rather than each file individually. It will mean you can't touch it without uncompressing/recompressing the whole thing. (Unlike JAR/ZIP where you might do this)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top