Multithreaded Unzipping In Java

Question 1

The fastest way to archieve this with Java is to use NIO. You can directly map the file into memory by using a MappedByteBuffer.

FileChannel channel = FileChannel.open(Paths.get("/path/to/zip"),
    StandardOpenOption.READ);
MappedByteBuffer buffer = channel.map(MapMode.READ_ONLY, 0, channel.size());

Now buffer contains a memory mapped region of your whole file. You can do whatever you want with it, for example pass an offset and a length to a thread. I don't know which zip lib supports that, but apparently you have something like that already.

FYI, I tested a bit with a 50 mb single file archive and it took less than 200ms on average to read it with a usual ZipInputStream- I think that you are trying to optimize pretty much nothing here.

Question 2

Just for posterity's sake, after some testing back and forth, the answer I finally ended up using goes as follows (with complete iterations starting from scratch with closed files in a while (true) loop):

Use DataInputStream.readFully to pull the entire (50 meg, in this case) zip file into a byte[].
Spawn worker threads (one per physical CPU core, 4 in my case) which each take that byte[] and create a ZipInputStream(ByteArrayInputStream). The first worker skips 0 entries, the second skips 1, the second skips 2, etc., so they're all offset from each other by one. The worker threads do not synchronize at all, so they all have their own local copies of the zip file's metadata and what-not. This is thread-safe because the zip file is read-only and the workers are not sharing decompressed data.
Each worker thread reads an entry and processes it, and then skips enough entries so that they all again are offset by one. So the first thread reads entries 0,4,8..., the second reads 1,5,9..., and so forth.
All the workers are pulled back in with .join().

My times were as follows:

Reading the zip file into the byte[] with no unzipping at all (just the IO) gives an average of 0.1 sec for every iteration.
Using straight ZipFile directly on the underlying file as normal, yields an initial spike of 0.5 sec, followed by an average of 0.26 sec for each iteration thereafter (starting from fresh after closing the previous ZipFile).
Reading the ZipFile into a byte[], creating a ZipInputStream(ByteArrayInputStream) with it with no multithreading at all, results in an initial spike of 0.3 sec, followed by an average of 0.26 sec for each iteration thereafter, showing that the disk caching was having an effect rendering random-access and initial-read equivalent.
Reading the ZipFile into a byte[], spawning 4 worker threads with that byte[] as described above, and waiting for them to finish, brought the time back down to an average of 0.1 sec for every iteration.

So, the verdict is, by this method I have successfully brought the processing of a moderately-sized zipfile with a moderately-powerful computer down to the time it takes to simply physically read the file, with the additional decompression step no longer noticeable at all. Obviously this same method on a huge zip file with tens of thousands of entries would still yield a massive speedup.

Seems I wasn't trying to optimize away nothing, considering I reduced the processing time for my sample file (which is around the size of the biggest one I'll need to work with) to 38% of the simple singly-threaded method.

Considering how incredibly well this hack-job worked, imagine the possible speedup with a native Java zip-reader class actually designed to do this without the synchronization built in.

Question 3

As you noticed, all the methods in ZipFile are synchronized. But that only stops multiple threads from running at the same time across different ZipFile instances, opened for the same exact zipfile on disk.

If you want multiple threads to read from the same zipfile in a scalable way, you must open one ZipFile instance per thread. That way, the per-thread lock in the ZipFile methods does not block all but one thread from reading from the zipfile at one time. It also means that when each thread closes the ZipFile after they're done reading, they close their own instance, not the shared instance, so you don't get an exception on the second and subsequent close.

Protip: if you really care about speed, you can get more performance by reading all the ZipEntry objects from the first ZipFile instance, and sharing them with all threads, to avoid duplicating work in reading ZipEntry objects for each thread separately. A ZipEntry object is not tied to a specific ZipFile instance per se, ZipEntry just records metadata that will work with any ZipFile object representing the same zipfile that the ZipEntry came from.