Question

I have a directory I’m archiving:

$ du -sh oldcode
1400848
$ tar cf oldcode.tar oldcode

So the directory is 1.4gb. The file is significantly smaller, though:

$ ls -l oldcode.tar
-rw-r--r-- 1 ieure ieure 940339200 2002-01-30 10:33 oldcode.tar

Only 897mb. It’s not compressed in any way:

$ file oldcode.tar
oldcode.tar: POSIX tar archive

Why is the tar file smaller than it’s contents?

Was it helpful?

Solution

You get a difference because of the way the filesystem works.

In a nutshell your disk is made out of clusters. Each cluster has a fixed size of - let's say - 4 kilobytes. If you store a 1kb file in such a cluster 3kb will be unused. The exact details vary with the kind of file-system that you use, but most file-systems work that way.

3kb wasted space is not much for a single file, but if you have lots of very small files the waste can become a significant part of the disk usage.

Inside the tar-archive the files are not stored in clusters but one after another. That's where the difference comes from.

OTHER TIPS

Having no knowledge of what tar you're using or what sort of Unix system you're using, here's my guess: oldcode contains numerous smaller files, which when by themselves use disk space inefficiently, since disk space is allocated by some sort of block, rather than byte by byte. In the tar file, they're concatenated, and make maximum use of the disk space they're assigned.

This has something to do with the blocksize of your filesystem. man 1 du on MacOSX 10.5.6 states:

The du utility displays the file system block usage for each file argument and for each directory in the file hierarchy rooted in each directory argument. If no file is specified, the block usage of the hierarchy rooted in the current directory is displayed.

[mirko@borg foo]$ ls -la
total 0
drwxr-xr-x   2 mirko  wheel   68 Jan 30 21:20 .
drwxrwxrwt  10 root   wheel  340 Jan 30 21:16 ..
[mirko@borg foo]$ du -sh
0B  .
[mirko@borg foo]$ touch foo
[mirko@borg foo]$ ls -la
total 0
drwxr-xr-x   3 mirko  wheel  102 Jan 30 21:20 .
drwxrwxrwt  10 root   wheel  340 Jan 30 21:16 ..
-rw-r--r--   1 mirko  wheel    0 Jan 30 21:20 foo
[mirko@borg foo]$ du -sh
0B  .
[mirko@borg foo]$ echo 1 > foo
[mirko@borg foo]$ ls -la
total 8
drwxr-xr-x   3 mirko  wheel  102 Jan 30 21:20 .
drwxrwxrwt  10 root   wheel  340 Jan 30 21:16 ..
-rw-r--r--   1 mirko  wheel    2 Jan 30 21:20 foo
[mirko@borg foo]$ du -sh
4.0K    .

As you see even a file of 2 bytes takes a whole block of 4kb. There are some filesystems which avoid this waste of space by block suballocation.

There are 2 possibilities.

Small files

Most likely, it isn't smaller than its contents. As Nils Pipenbrinck wrote, du displays the amount of space the filesystem allocates, which since files are stored in filesystem blocks is more than the logical size of the file.

To view the logical size of the file, use du --apparent-size. In this case, the result should be smaller than the tar file.

Sparse files

Tar files can store sparse files. If the tarball was created using --sparse, the holes in the sparse files will be recorded, so the tarball could be smaller than the logical size of the files.

If the sparseness information in your extracted copy was somehow lost (e.g. if you extracted the tarball onto a filesystem that doesn't support sparse files, or if it was zipped and then unzipped, etc.), then df will report the expanded size.

du counts disk blocks, not file size duder.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top