Question

I'm in the process of implementing caching for my project. After looking at cache directory structures, I've seen many examples like:

cache
cache/a
cache/a/a/
cache/a/...
cache/a/z
cache/...
cache/z
...

You get the idea. Another example for storing files, let's say our file is named IMG_PARTY.JPG, a common way is to put it in a directory named:

files/i/m/IMG_PARTY.JPG

Some thoughts come to mind, but I'd like to know the real reasons for this.

  • Filesystems doing linear lookups find files faster when there's fewer of them in a directory. Such structure spreads files thin.

  • To not mess up *nix utilities like rm, which take a finite number of arguments and deleting large number of files at once tends to be hacky (having to pass it though find etc.)

What's the real reason? What is a "good" cache directory structure and why?

Was it helpful?

Solution

Every time I've done it, it has been to avoid slow linear searches in filesystems. Luckily, at least on Linux, this is becoming a thing of the past.

However, even today, with b-tree based directories, a very large directory will be hard to deal with, since it will take forever and a day just to get a listing of all the files, never mind finding the right file.

OTHER TIPS

Just use dates. Since you will remove by date. :)

If you do ls -l, all the files need to be stat()ed to get details, which adds considerably to the listing time - this happens whether the FS uses hashed or linear structures.

So even if the FS has a capability of coping with incredibly large directory sizes, there are good reasons not to have large flat structures (They're also a pig to back up)

I've benchmarked GFS2 (clustered) with 32,000 files in a directory or arranged in a tree structure - recursive listings were around 300 times faster than getting a listing when they were all in a flat structure (could take up to 10 minutes to get a directory listing)

EXT4 showed similar ratios but as the end point was only a couple of seconds most people wouldn't notice.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top