Using filesystem as database for 15M files - is it efficient?

Question 1

There are a few reasons you probably want to look at a database (not necessarily MySQL) rather than the file system for this sort of thing:

More files in one directory slow things down

Although XFS is supposed to be very clever about allocating resources, most filesystems experience degrading performance the more files you have in a single directory. It also becomes a headache to deal with them on the command line. Having a look at this (http://oss.sgi.com/projects/xfs/datasheet.pdf) there's a graph on there about look ups, which only goes up to 50k per directory, and it's on the way down.

Overhead

There is a certain amount of filesystem overhead per file. If you have many small files, you may find that the final store bloats as a result of this.

Key cleaning

Are all your words safe to put in a filename? Are you sure? A slash or two in there is really going to ruin your day.

NoSQL might be a good option

Something like MongoDB/Redis might be a good option for this. MongoDB can store single documents of up to 16mb and isn't much harder to use that putting things on the file system. If you are storing 15mb documents, you might be getting a bit too close for comfort on that limit, but there are other options.

The nice thing about this is, the lookup performance is likely to be pretty good off the bat and if you later on find it isn't you can scale the performance by creating a cluster etc. Any system like this will also do a good job of managing the files on the disk intelligently for good performance.

If you are going to use the disk

Consider taking an MD5 hash of the word you want to store, and base your filename on this. For example the MD5 of azpdk is:

1c58fb66d5a4d6a1ebe5ec9e217fbbf9

You could use this to create a filename e.g.:

my_directory/1c5/8fb/66d5a4d6a1ebe5ec9e217fbbf9

This has a few nice features:

The hash takes care of scary characters
The directories spread out the data, so no directory has more than 4096 entries
This means the lookup performance should be relatively decent

Hope that helps.

Question 2

I worked in a genomics research centre, where the bioinfomraticins were not especially experienced programmers.

Rather than use a database, a few of them would generate millions of small files, until the filesystem ground to a halt.