Python tarfile size

https://stackoverflow.com/questions/10028435

29-05-2021
|

Question

I can calculate the size of the files in a tarfile in this way:

import tarfile
tf = tarfile.open(name='my.tgz', mode='r')
reduce(lambda x,y: getattr(x, 'size', x)+getattr(y,'size',y), tf.getmembers())

but the total size returned is the sum of the elements in the tarfile and not the compressed file size (at least this is what I believe by trying this). Is there a way to get the compressed size of the whole tar file without checking it through something like the os.path.getsize?

Solution

No.

The way tar.gz works is that the file is piped through gzip to get a plain tar archive. tar(1) has no idea that the archive was compressed in the first place, so it can't know about compressed sizes[*].

This is unlike archive formats like ZIP which compress by themselves.

The advantage of the tar approach is that you can use any compression that you like. If some better compressor comes along, you can easily repack your archives. Also, since everything is put into one big stream of data, compression ratio is slightly better and meta data like file names is also compressed.

The disadvantage is that you must seek in the archive file to unpack individual items.

[*]: The first implementations of tar(1) had no -z option; it was added later when people started to use gzip a lot. In the early days, the standard compression was using compress to get tar.Z.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow