Question

Disclaimer: using cloud services like Amazon S3 or Azure Blob Storage isn't an option at all.

Goal: hosting millions (*) of images and video files on Windows Servers. I am aware of the limitations of NTFS in that context. So I gave MongoDB with GridFS with its 2 GB containers a try which worked well but a bit slowly (I did not figure out why yet).

My questions:

  1. Are there any real world reports regarding the usage of MongoDB/GridFS in context of large amounts of files?
  2. Is there any known other option which is reliable, easily configureable and horizontal scalable?

I know that my scenario is described very vague, but I don't have any real data for now, so please don't blame me ;-).

(*) propably only tens of thousands to hundreds of thousands, but hopefully someday millions ...

Thanks!

Was it helpful?

Solution 2

Given the fact that I have no idea about GridFS I'll just put down something I've seen couple of years ago in a fairly large (250+ million documents @ 10kb to hundreds mb size) system.

The document retrieval was initiated by a host system (probably your core application) which only knew a repository name and a token for the document.

The document storage itself consisted of a web server a database and a (quiet sophisticated) filesystem (SAN with SATA, SCSI and tape).

The web server received a request for a document in a certain repo, fetched the metadata from a database (reponame, token -> foldername, filename) fetched the file from disk and spit it out over the wire. No database integrated filestreams etc. had been used. This concept was very fast, easy and sturdy. We once did a comparison against some database storage (IIRC Oracle and MSSQL) which resulted in a disaster for these databases especially in terms of speed. I think MSSQL wasn't using the native filesystem at these times.

To add some horizontal scalability you probably only have to find a mechanism to distribute your load between servers (a.k.a repositories, shards).

From my experience the retrieval and loading speed of files in such document stores are highly interconnected to the kind of storage you use. RAID systems, SANs, in memory filesystems or RAMSAN are a must have, depending on your requirements.

IMHO if you want speed, always use the native filesystem and know what it's doing. This implies that you have to do some of the dirty work (especially sharding) by yourself.

OTHER TIPS

I would like to share our success story. We are using MongoDB GridFS for storing millions of images. One of our storage has:

  • 2 shards of mongodb
  • about 500 Gb of data
  • 14,998,166 files
  • 2.5 Gb index size

As a frontend we've got nginx and simple daemon written in Go, that able to serve data from GridFS more than 1,000 requests per second.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top