Cache directory structure

https://stackoverflow.com/questions/616099

03-07-2019
|

Question

I'm in the process of implementing caching for my project. After looking at cache directory structures, I've seen many examples like:

cache
cache/a
cache/a/a/
cache/a/...
cache/a/z
cache/...
cache/z
...

You get the idea. Another example for storing files, let's say our file is named IMG_PARTY.JPG, a common way is to put it in a directory named:

files/i/m/IMG_PARTY.JPG

Some thoughts come to mind, but I'd like to know the real reasons for this.

Filesystems doing linear lookups find files faster when there's fewer of them in a directory. Such structure spreads files thin.
To not mess up *nix utilities like rm, which take a finite number of arguments and deleting large number of files at once tends to be hacky (having to pass it though find etc.)

What's the real reason? What is a "good" cache directory structure and why?

Solution

Every time I've done it, it has been to avoid slow linear searches in filesystems. Luckily, at least on Linux, this is becoming a thing of the past.

However, even today, with b-tree based directories, a very large directory will be hard to deal with, since it will take forever and a day just to get a listing of all the files, never mind finding the right file.

OTHER TIPS

Just use dates. Since you will remove by date. :)

If you do ls -l, all the files need to be stat()ed to get details, which adds considerably to the listing time - this happens whether the FS uses hashed or linear structures.

So even if the FS has a capability of coping with incredibly large directory sizes, there are good reasons not to have large flat structures (They're also a pig to back up)

I've benchmarked GFS2 (clustered) with 32,000 files in a directory or arranged in a tree structure - recursive listings were around 300 times faster than getting a listing when they were all in a flat structure (could take up to 10 minutes to get a directory listing)

EXT4 showed similar ratios but as the end point was only a couple of seconds most people wouldn't notice.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow