When executing "tar" on a directory with over a billion files, the process stayed in D status

StackOverflow https://stackoverflow.com/questions/14068720

Question

I was doing some experiments to learn more about Linux process states.

So, there's a directory(named big_dir) with over a billion files in it(the directory has many sub-directories recursively), and then I run tar -cv big_dir | ssh anotherServer "tar -xv -C big_dir" and found out via executing top that, the tar process stays in D status. Meanwhile, the tar command keeps outputting the paths of the files.

I know that, the process was in D status because it was doing disk I/O, but why didn't its status keep switching between D and R? Printing the file names under the directory must have used some CPU computation, isn't it? Otherwise how could the find command know that it should print something?

If I run dd if=/dev/zero of=/dev/null, then the dd process status kept in R status from the top output. But why wasn't it in D status? Wasn't it doing I/O all the time?

Était-ce utile?

La solution

/dev/zero and /dev/null are pseudo-devices. So there's no physical device behind them.

If I do

dd if=/dev/zero of=/tmp/zeroes

then top does show me dd in the D status. However it does spend a lot of it's time in R (in CPU time). top will simply sample the process table and consequently you may need to watch it for some time in order to see transient states.

I suspect for your tar example above that the amount of time outputting to stdout is negligible compared to the disk time. Note also that outputting to stdout will also involve the windowing system writing and whilst it's doing that the process will be sleeping. e.g. I'm running yes right now, and the majority of the work is being performed by my X server. The yes process is sleeping for most of the time I'm watching it (via top)

Autres conseils

I'm sure your tar process SOMETIMES goes to R, but it's probably for a very short period of time, because it doesn't do that much - particularly since you are sending the data through a network. Unless that's a 10Gb/s network card [and everything else to "anotherServer" is really working at 1GB/s], this will be the slowest part of the chain. ssh itself will take a little bit of overhead as it encrypts the data.

It probably takes tar a few microseconds to ask for some data from the disk, and a few milliseconds for the disk to move its head and read the actual data. So you have about 0.1% of the time in "R", the rest is in "D".

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top