Question

I have 15 million simple key/value records. The key sizes are all single words, the values they contain range in size from a few bytes to 10MB each.

Random keys will need to be frequently accessed.

I'm thinking that it would be much more efficient to just store these as files in a directory instead of in a database. So instead of having massive table with all of these entries all I need is a directory with the filename as the key and the value inside the file.

This means that if I want the value for key azpdk I just need to file_get_contents('/my/directory/azpdk') in PHP instead of troubling MySQL with such a request.

In my head this makes sense and I expect it to be more efficient to use the filesystem instead of a database for this. Am I correct in this assumption? Will this still be fast and efficient with 15 million files in one directory?

FYI the filesystem is xfs.

Was it helpful?

Solution

There are a few reasons you probably want to look at a database (not necessarily MySQL) rather than the file system for this sort of thing:

More files in one directory slow things down

Although XFS is supposed to be very clever about allocating resources, most filesystems experience degrading performance the more files you have in a single directory. It also becomes a headache to deal with them on the command line. Having a look at this (http://oss.sgi.com/projects/xfs/datasheet.pdf) there's a graph on there about look ups, which only goes up to 50k per directory, and it's on the way down.

Overhead

There is a certain amount of filesystem overhead per file. If you have many small files, you may find that the final store bloats as a result of this.

Key cleaning

Are all your words safe to put in a filename? Are you sure? A slash or two in there is really going to ruin your day.

NoSQL might be a good option

Something like MongoDB/Redis might be a good option for this. MongoDB can store single documents of up to 16mb and isn't much harder to use that putting things on the file system. If you are storing 15mb documents, you might be getting a bit too close for comfort on that limit, but there are other options.

The nice thing about this is, the lookup performance is likely to be pretty good off the bat and if you later on find it isn't you can scale the performance by creating a cluster etc. Any system like this will also do a good job of managing the files on the disk intelligently for good performance.

If you are going to use the disk

Consider taking an MD5 hash of the word you want to store, and base your filename on this. For example the MD5 of azpdk is:

1c58fb66d5a4d6a1ebe5ec9e217fbbf9

You could use this to create a filename e.g.:

my_directory/1c5/8fb/66d5a4d6a1ebe5ec9e217fbbf9

This has a few nice features:

  • The hash takes care of scary characters
  • The directories spread out the data, so no directory has more than 4096 entries
  • This means the lookup performance should be relatively decent

Hope that helps.

OTHER TIPS

I worked in a genomics research centre, where the bioinfomraticins were not especially experienced programmers.

Rather than use a database, a few of them would generate millions of small files, until the filesystem ground to a halt.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top