Question

I'm working on a system that will need to store a lot of documents (PDFs, Word files etc.) I'm using Solr/Lucene to search for revelant information extracted from those documents but I also need a place to store the original files so that they can be opened/downloaded by the users.

I was thinking about several possibilities:

  • file system - probably not that good idea to store 1m documents
  • sql database - but I won't need most of it's relational features as I need to store only the binary document and its id so this might not be the fastest solution
  • no-sql database - don't have any expierience with them so I'm not sure if they are any good either, there are also many of them so I don't know which one to pick

The storage I'm looking for should be:

  • fast
  • scallable
  • open-source (not crucial but nice to have)

Can you recommend what's the best way of storing those files will be in your opinion?

Was it helpful?

Solution

A filesystem -- as the name suggests -- is designed and optimised to store large numbers of files in an efficient and scalable way.

OTHER TIPS

You can follow Facebook as it stores a lot of files (15 billion photos):

  • They Initially started with NFS share served by commercial storage appliances.
  • Then they moved to their onw implementation http file server called Haystack

Here is a facebook note if you want to learn more http://www.facebook.com/note.php?note_id=76191543919

Regarding the NFS share. Keep in mind that NFS shares usually limits amount of files in one folder for performance reasons. (This could be a bit counter intuitive if you assume that all recent file systems use b-trees to store their structure.) So if you are using comercial NFS shares like (NetApp) you will likely need to keep files in multiple folders.

You can do that if you have any kind of id for your files. Just divide it Ascii representation in to groups of few characters and make folder for each group. For example we use integers for ids so file with id 1234567891 is stored as storage/0012/3456/7891.

Hope that helps.

In my opinion...

I would store files compressed onto disk (file system) and use a database to keep track of them.

and posibly use Sqlite if this is its only job.

File System : While thinking about the big picture, The DBMS use the file system again. And the File system is dedicated for keeping the files, so you can see the optimizations (as LukeH mentioned)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top