Pregunta

I'm using Apache Commons Compress library for iterating .tar.gz files. My question is that if I'm iterating over tar file using .getNextTarEntry() can I always assume that tarArchiveEntry objects are descendants of previous entries which are directories. I'm having trouble explaining this in plain English so here is code sample:

try (
                    FileInputStream fileInputStream = new FileInputStream(tarFile);
                    GZIPInputStream gzipInputStream = new GZIPInputStream(fileInputStream);
                    TarArchiveInputStream tarArchiveInputStream = new TarArchiveInputStream(gzipInputStream);) {

                TarArchiveEntry tarArchiveEntry;

                while (null != (tarArchiveEntry = tarArchiveInputStream.getNextTarEntry())) {
                    if (tarArchiveEntry.isDirectory()) {
                        currentDirEntry = tarArchiveEntry
                    } else {
                        //Is tarAchiveEntry always "child" of currentDirEntry
                    }
                }
            }

My problem is that I'm dealing with huge .tar.gz files (several GB large, containing > 100k files) and I don't want to parse parent directory name (they contain important information) for every single file. I'd just like to parse directory name once and assume all next entries are children of this directory. If I hit next directory then this process begins from the start.

I can't use DIY approach since I'm not sure what affects file order when creating .tar.gz files but since tar format doesn't contain any index (as far as i know?), it would make sense that directory entries are listed before their content.

Any help appreciated.

¿Fue útil?

Solución

As tar archives don't have an index, commons-compress can't tell whether another file in the most recently unpacked directory will occur later (without decompressing the whole file). Thus your question is really about the behavior of the compressing program, not your decompressor.

In general, there's no restriction on the order of entries in a tar file (or even their uniqueness -- later entries may overwrite earlier ones). My command-line tar will pack files into the archive in the order they're passed on the command line, so I can alternate like a/foo b/bar a/baz b/quux and that's the order they're packed in. I might do this, for example, to keep similar files nearby each other in the archive, for better compression with dictionary-based (sliding window) algorithms like gzip.

You can assume all files in a directory are listed contiguously in a tar archive only if you have special knowledge of the archiver which created the files you're processing.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top