Filesystem seek performance with lots of tiny files

https://stackoverflow.com/questions/432603

10-07-2019
|

Question

I'm looking to build a server with lots of tiny files delivered by an XML API. It won't be doing a whole lot of iterating over directories or blocks of sequential files--we're talking lots and lots of seeks for discontinuous data.

Will seek time on BSD UFS degrade over time for requests for individual files? I understand that the filesystem's inode limit is based on the size of the partition/slice, but the hard drive has to step through the inode table for every file request before it can discover the location of the data. What filesystem yields the best performance for seek time?

The alternative is to setup 2-4GB "blob" files and have a separate system of seeking a file contained in them from within the software. The software's "inode table" could be optimized for delivery based on currently logged in user, etc... These "inode tables" would likely be cached in RAM and would only relate to the users currently logged in so that there are fewer wasted resources.

Where do these two solutions rate on a scalability and maintenance standpoint? What sort of performance gains, if any, could I expect by using the second solution?

Solution

The most obvious and time-proven mitigation technique is to use a good hierarchical design for directories (and pathname search strategies), and have more directories with fewer files in each.

OTHER TIPS

For recent FreeBSD versions with dirhash and softupdates I have seen no problems with a few ten thousand files per directory. You probably don't want to go north of 500.000 files or so. E.g. deleting a directory with 2.500.000 files took me three days.

I'm not sure i understand you question correctly, but if you want to seek over lots of files, why not use a partioned mysql table laid out on a RAID0 or VFS filesystem?

Edit: as far as i know, lots of files in one folder will degrade any FS speed as it has to maintain bigger lists of files, permissions and names, a database is designed to keep lists of data in memory and seek in a very optimized way through it.

More details of your situation would be helpful, are the files existing or would they be created by your application? If you need a way to store arbitrary data with out the structure of a relational database have you looked at object databases

Another option, if your objects should or can be accessed via HTTP, is to use a varnish cache in front of a small web server. Initially objects would be stored on disk, but varnish would store and serve objects from memory after the first access to a given object.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow