Hosting large amounts of binary data (Images) on Windows Servers

Question 1

Given the fact that I have no idea about GridFS I'll just put down something I've seen couple of years ago in a fairly large (250+ million documents @ 10kb to hundreds mb size) system.

The document retrieval was initiated by a host system (probably your core application) which only knew a repository name and a token for the document.

The document storage itself consisted of a web server a database and a (quiet sophisticated) filesystem (SAN with SATA, SCSI and tape).

The web server received a request for a document in a certain repo, fetched the metadata from a database (reponame, token -> foldername, filename) fetched the file from disk and spit it out over the wire. No database integrated filestreams etc. had been used. This concept was very fast, easy and sturdy. We once did a comparison against some database storage (IIRC Oracle and MSSQL) which resulted in a disaster for these databases especially in terms of speed. I think MSSQL wasn't using the native filesystem at these times.

To add some horizontal scalability you probably only have to find a mechanism to distribute your load between servers (a.k.a repositories, shards).

From my experience the retrieval and loading speed of files in such document stores are highly interconnected to the kind of storage you use. RAID systems, SANs, in memory filesystems or RAMSAN are a must have, depending on your requirements.

IMHO if you want speed, always use the native filesystem and know what it's doing. This implies that you have to do some of the dirty work (especially sharding) by yourself.

Question 2

I would like to share our success story. We are using MongoDB GridFS for storing millions of images. One of our storage has:

2 shards of mongodb
about 500 Gb of data
14,998,166 files
2.5 Gb index size

As a frontend we've got nginx and simple daemon written in Go, that able to serve data from GridFS more than 1,000 requests per second.