file quantity limit in a directory on a linux file server and why?

https://softwareengineering.stackexchange.com/questions/260558

05-10-2020
|

Pregunta

What is a good limit to use on the quantity of files in a directory, and why?

EDIT: Why shouldn't someone create a system that puts hundreds of thousands of files in the same directory?

Why I ask:

Someone set up a system that dumps files in a folder with a human readable date-time.

My task it to create a system that gets the files for a select time period.

Normally this wouldn't be a problem but the folder has 500,000 files in it and is growing and my system is expected to get them in real time.

Parsing 500,000 files takes too long, so I think it is the responsibility of the person who built the system that inputs the files on the FTP server to create a directory structure such as having a sub folder for each day.

Solución

It depends upon the actual system (operating system, computer hardware, file system). Some [old Linux] file systems behaved badly (linear access time w.r.t. to number of directory entries). So it is generally preferable to have small sized directories of a few thousand entries at most (and this also makes the shell more happy: you might want to ls the directory). It is usually inefficient to have millions of very small files (e.g. a hundred bytes each), because often each file uses at least one file-system block.

So I would suggest having files like dir001/file001.txt .... dir123/file345.txt ....

Consider also using some other way of storing data: some indexed data file like GDBM, a Sqlite or PostGresql or MongoDb database, etc. And you might even mix the approach: use Sqlite for some meta-data about your files, and keep them in many directories. You might also have a seggregative approach: handle differently small contents and bigger ones (put the small contents in Sqlite or GDBM, and the large contents in files).

Otros consejos

Stock answer: "It depends".

Trying to do anything of any size or complexity using the file system as a "database" is going to be tricky and trying to do it "in real time" is going to be downright difficult.
The problem is that you tend to have lots of files, which means that you have to operate on "a" file many, many times. Doing "something" once takes an amount of time; doing that "something" many times takes many times longer; it doesn't scale well.

That said, you can offset this overhead this by organising your data sensibly. You say you're working with date ranges, so arrange your files into a directory structure that supports this, an obvious choice might be this:

root/
  2014/
    01/ 
      01/
        2014-01-01-00-00-00.dat 
        2014-01-01-00-00-01.dat 
      02/
        2014-01-02-00-00-00.dat 
        2014-01-02-00-00-01.dat

Now, retrieving a days'-worth of files is relatively straight-forward (and, therefore, quick).

Of course, depending on what you're doing, loading the whole thing into a database might be a better solution ... YMMV

Licenciado bajo: CC-BY-SA con atribución

No afiliado a softwareengineering.stackexchange