What is the fastest / easiest way to count large number of files in a directory (in Linux)?

https://stackoverflow.com/questions/6083006

08-09-2020
|

Frage

I had some directory, with large number of files. Every time I tried to access the list of files within it, I was not able to do that or there was significant delay. I was trying to use ls command within command-line on Linux and web interface from my hosting provider did not help also.

The problem is, that when I just do ls, it takes significant amount of time to even start displaying something. Thus, ls | wc -l would not help also.

After some research I came up with this code (in this example it counts number of new emails on some server):

print sum([len(files) for (root, dirs, files) in walk('/home/myname/Maildir/new')])

The above code is written in Python. I used Python's command-line tool and it worked pretty fast (returned result instantly).

I am interested in the answer to the following question: is it possible to count files in a directory (without subdirectories) faster? What is the fastest way to do that?

Lösung

ls does a stat(2) call for every file. Other tools, like find(1) and the shell wildcard expansion, may avoid this call and just do readdir. One shell command combination that might work is find dir -maxdepth 1|wc -l, but it will gladly list the directory itself and miscount any filename with a newline in it.

From Python, the straight forward way to get just these names is os.listdir(directory). Unlike os.walk and os.path.walk, it does not need to recurse, check file types, or make further Python function calls.

Addendum: It seems ls doesn't always stat. At least on my GNU system, it can do only a getdents call when further information (such as which names are directories) is not requested. getdents is the underlying system call used to implement readdir in GNU/Linux.

Addition 2: One reason for a delay before ls outputs results is that it sorts and tabulates. ls -U1 may avoid this.

Andere Tipps

Total number of files in the given directory

find . -maxdepth 1 -type f | wc -l

Total number of files in the given directory and all subdirectories under it

find . -type f | wc -l

For more details drop into a terminal and do man find

This should be pretty fast in Python:

from os import listdir
from os.path import isfile, join
directory = '/home/myname/Maildir/new'
print sum(1 for entry in listdir(directory) if isfile(join(directory,entry)))

I think ls is spending most of its time before displaying the first line because it has to sort the entries, so ls -U should display the first line much faster (though it may not be that much better in total).

The fastest way would be to avoid all the overhead of interpreted languages and write some code that directly addresses your problem. Doing so is difficult to do in a portable way, but pretty straightforward. At the moment I'm on an OS X box, but converting the following to Linux should be extremely straightforward. (I opted to ignore hidden files and only count regular files...modify as necessary or add command line switches to get the functionality you want.)

#include <dirent.h>
#include <stdio.h>
#include <stdlib.h>

int
main( int argc, char **argv )
{
    DIR *d;
    struct dirent *f;
    int count = 0;
    char *path = argv[ 1 ];

    if( path == NULL ) {
        fprintf( stderr, "usage: %s path", argv[ 0 ]);
        exit( EXIT_FAILURE );
    }
    d = opendir( path );
    if( d == NULL ) { perror( path );exit( EXIT_FAILURE ); }
    while( ( f = readdir( d ) ) != NULL ) {
        if( f->d_name[ 0 ] != '.'  &&  f->d_type == DT_REG )
            count += 1;
    }
    printf( "%d\n", count );
    return EXIT_SUCCESS;
}

I'm not sure about speed, but if you want to just use shell builtins this should work:

#!/bin/sh
COUNT=0;
for file in /path/to/directory/*
do
COUNT=$(($COUNT+1));
done
echo $COUNT

My use case is a linux SBC (Banana Pi) counting files in a directory on a FAT32 USB stick. In a shell, doing

ls -U {dir} | wc -l

takes 6.4secs with 32k files in there (32k = max files/dir on FAT32) From python doing

t=time.time() ; print len(os.listdir(d)) ; print time.time()-t

takes only 0.874secs(!) Can't see anything else in Python being quicker than that.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow