Question

Please consider the following:

I am storing around 1.2 Million TIF files ranging from 40 KB to 120 KB in size.

These documents are stored on a windows server with NTFS file-system.

The documents are stored using the following variables:

  • client
  • document type
  • image folder
  • actual image

See below:

C:\<client_id>\<doc_type_id>\image001\1.TIF

Example

C:\1\3\image001\1.TiF

It is a PHP hosted system.

The performance is acceptable at this stage. I want to know what the best strategy is going forward. Considering that the customers and document amounts are going to increase dramatically.

I am looking at replacing the complete storage with Jackrabbit CMS.

Would this be the way? Or

Is storing the documents in a format like:

  • Customer
  • Document type
  • Julian date day of the year document imported.
  • Current User
  • 6 digit unique code

Example

C:\1\1\167\2\453257\image001\image.TIF

going to be just as efficient?

Please take all other considerations of CMS vs File-system out of the picture. e.g versioning, data backup.

Thanks.

Was it helpful?

Solution

Your question is very similar to this one. Is your load primarily reading your images or writing? If it's read scalability you need, the post describes memcached, which is probably all you need. jackrabbit has loads more features, but is more for hierarchical text storage. Not sure it will do any better performance wise on your images. Also, if you do choose jackrabbit, make sure your content hierarchy is deep enough for jackrabbit to stay efficient. Any parent with 10,000 or more children is going to have sub-par performance.

OTHER TIPS

Honestly? I don't think it matters until you get to a certain size (and I can't, for the life of me, remember that size...). The thing is to find a method and then stick with it, hopefully it'll be in such a way that you never need to touch it again. My own advice, without anything as convincing as evidence to support it, is something akin to your own suggestion:

c:\<customer_id>\<document_year>\<document_month>\<document_day>\actual_file.tif

I'd also raise the suggestion that, depending on your server setup, it might be worth giving each customer (depending on the amount of data or account type) their own drive/partition.

Bear in mind that, without some sort of user-control or permissions system, that file-paths could be predictably guessed and browsed (as if you didn't know this already...I know, I'm sorry). The fact that you raised the bullet point of 'six digit unique code' suggests that you don't need a path of common-format, but I would suggest that a common-format (of whatever format you end up choosing) would be a better idea.

Back in my Windows days I sorted my own directories around the file's primary-relation, it'd be considered a 'tag' nowadays (c:\documents and settings\university\year1\module21\assignment1.doc for example), this made it easier to find things later. Your customers appear to have their directory structure enforced -by you- but finding things that they did last week is easier if they only have to traverse the date, remembering where they put something last week when they get to the six-digit unique number-named folders is going to be, well, difficult. At best.

The strategy for storage you proposed would need to be addressed if you intend to move your content to different machines (SAN/NAS). To do this, you would need to strip all the customer data from the path, and just create a hash that you then save in the database to link to the file you are accessing. This way you are left with a folder structure something like this:

NAS1/00/01/86/63/54/89/image01/image.tiff
NAS2/00/02/46/62/22/11/image02/image.tiff
...

I would also recommend you take a gander at MogileFS. All you need to do to speed it up is to add some sort of a proxy in front of it and all should be well.

And like Dave mentioned, make sure you don't have too many children in one folder. Things tend to get quite sluggish around 10.000.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top