Just for posterity's sake, after some testing back and forth, the answer I finally ended up using goes as follows (with complete iterations starting from scratch with closed files in a while (true)
loop):
Use DataInputStream.readFully
to pull the entire (50 meg, in
this case) zip file into a byte[]
.
Spawn worker threads (one per physical CPU core, 4 in my case)
which each take that byte[]
and create a
ZipInputStream(ByteArrayInputStream)
. The first worker skips 0
entries, the second skips 1, the second skips 2, etc., so they're all
offset from each other by one. The worker threads do not synchronize
at all, so they all have their own local copies of the zip file's
metadata and what-not. This is thread-safe because the zip file is
read-only and the workers are not sharing decompressed data.
Each worker thread reads an entry and processes it, and then skips
enough entries so that they all again are offset by one. So the first
thread reads entries 0,4,8..., the second reads 1,5,9..., and so
forth.
All the workers are pulled back in with .join().
My times were as follows:
Reading the zip file into the byte[]
with no unzipping at all
(just the IO) gives an average of 0.1 sec for every iteration.
Using straight ZipFile directly on the underlying file as normal,
yields an initial spike of 0.5 sec, followed by an average of 0.26
sec for each iteration thereafter (starting from fresh after closing
the previous ZipFile).
Reading the ZipFile into a byte[]
, creating a
ZipInputStream(ByteArrayInputStream)
with it with no multithreading
at all, results in an initial spike of 0.3 sec, followed by an
average of 0.26 sec for each iteration thereafter, showing that the
disk caching was having an effect rendering random-access and
initial-read equivalent.
Reading the ZipFile into a byte[]
, spawning 4 worker threads
with that byte[]
as described above, and waiting for them to
finish, brought the time back down to an average of 0.1 sec for every
iteration.
So, the verdict is, by this method I have successfully brought the processing of a moderately-sized zipfile with a moderately-powerful computer down to the time it takes to simply physically read the file, with the additional decompression step no longer noticeable at all. Obviously this same method on a huge zip file with tens of thousands of entries would still yield a massive speedup.
Seems I wasn't trying to optimize away nothing, considering I reduced the processing time for my sample file (which is around the size of the biggest one I'll need to work with) to 38% of the simple singly-threaded method.
Considering how incredibly well this hack-job worked, imagine the possible speedup with a native Java zip-reader class actually designed to do this without the synchronization built in.