Question

I'm in the process of designing a server responsible for serving files that are between 10MB and 50MB in size.
Initially we will run two instances of the server (lets call them fs1 and fs2), with future plans to switch to a micro-service architecture, where the server instances will grow or shrink depending on the load.

These two instances need to interact with a third server running a scheduler and a file management application, as well as a database (on another server) where some metadata will be saved for clients to use.

My initial thoughts where to use a rabbitmq to allow the fs1 and fs2 to communicate with each other and the management app. the process would work as follows:

  1. The management app uploads to fs1 server (could be either fs1 or fs2)
  2. fs1 notifies fs2 and the management app when upload is complete
  3. fs2 contacts fs1 and stores a copy of the file
  4. fs2 notifies the management app when upload is complete
  5. The management app saves metadata to the external database
  6. both fs1 and fs2 can now server the files when requested

This seems OK, if there are only two instances, but once you start adding more it doesn't work.
Our ops department are very much against the idea of using the database to store files. They are worried that it will slow down the system too much. I agree it might, which is why I want a separate database for the specific purpose of storing the files and metadata. I want to build something like the following:
system diagram

My thinking is that the upload service can manage uploading of files and saving of metadata to the database. When the scheduler schedules a new job, the upload service (badly named, I know, but I'm not making that image again :-) ) can notify the file server instances that they need to cache the required file(s) from the database, which they can access directly.
The file servers won't need to cache more than 5 or 6 files each at a time.
Also, in the diagram I missed that the file management service will receive download progress messages from both file servers.

So to my questions:

  1. Is this a reasonable way to store files of this size for serving?
  2. Is this the right way to be thinking when considering the move to microservices in the future?
  3. Are there advantages to storing the files on the file system of each fs instance instead of just caching?
  4. How can I convince our ops team that storing 50MB files in a database is the way to go? what are the pros and cons?
  5. Any other thoughts or comments appreciated.
Was it helpful?

Solution

NO, don't store files in a relational database

Trust me, I've learned this the hard way. One problem with applications that deal with files, is as they evolve, the users always want to store more than the application was intended to handle.

I once created an application with a document storage component meant to store Word and Excel documents. The storage component was useful enough that eventually people started storing videos in it.

I mention this because, the performance implications will be higher than you expect; this leads me to my next point.

Even if a database can handle files fine (filestream type) scaling a DB is hard, it is always the hardest part to scale. Let the db concentrate on saving and retrieving data, that way you can put off scaling it as long as possible. If your DB is busy serving a large file, those are resources not being used to serve transaction and lookup requests; its bread and butter.

Server to Server synchronization does not scale well

Your system seems over-complicated to me, I would go with a simpler design. The problem with servers fs1 and fs2 talking to each other is, as you scale, the number of paths increases exponentially.

With two servers, each server only has to ask make one synch request, for a total of 2 paths. 3 severs, there are a total of 6. With 5 servers there are 20. synchRequests = (n-1)*(n); n = number of servers

I would simply have a dedicated DB server, and a dedicated File server that the FSn servers talk to to. If you need more complex synchronization behavior, add a dedicated Redis serve in the mix to serve as the single source of truth for non-persistent details.

The point is, don't have fs1 talking to fs2, or vice-versa, this will not scale.

Graph

                             [ fs1 ] [ fs2 ] [ fs3 ] [ ect ]
                                |       |       |       |
                                +-------+---+---+-------+
                                            |
                         +------------------+-------------------+
                         |                  |                   |
                      [ RDB ]           [ Redis ]           [ Files ] 

The best of both worlds?

You can head off most of the disadvantages of storing your files in a RDB, and still get most of the advantages by segregating a completely separate DB instance and storing only your files there. This is a viable option if you don;t want to setup and maintain a file server.

A quick word about microservices

I am not sure why you would want to go the microservices route. The original intent of microservices is to get around political problems, not technical problems. For example, the server admin refuses to open any ports other than 80.

OTHER TIPS

Obviously, you have some options but there are also trade-offs. To answer your questions:

Is this a reasonable way to store files of this size for serving?

The answer of that really depends on your database and the recommendations for that. For example SQL Server has good support for arbitrary sized files if you use the FILESTREAM option (or something that in turn makes use of it like FileTable). But you'll get a different answer with something like Sqlite that can run into problems if the overall database file gets too large.

If you are hosting in the cloud then the best bet is to make use of the blob store available by your provider. You can even use something like the JCloud library (from Apache) to abstract the actual cloud provider from the process of storing and retrieving files from the blob store. AWS calls it S3, but all providers have some sort of blob store that is geared for serving directly.

Is this the right way to be thinking when considering the move to microservices in the future?

Probably not. Think of microservices as complete and autonomous entities. Whether you use S3, a simple file store, or a dedicated database.

By creating something like a "Blob Store" microservice, you can use hashes to prevent duplicate files. If two people upload the same file by different names or paths, you could be reasonably sure there is only one copy of it in your blob store. You would store your file by the hash for the filename, and if necessary you could put metadata in a JSON file that is stored with the same filename but a .json extension. Or your metadata is looked up by hash.

Are there advantages to storing the files on the file system of each fs instance instead of just caching?

Not as much as using S3 which hides away the whole caching/storage distribution problem so you don't have to solve it.

However, the file system allows you to have greater control over how your files are stored, allowing for encryption and/or compression at the service level if you need it.

How can I convince our ops team that storing 50MB files in a database is the way to go? what are the pros and cons?

Depends on:

  1. How many files we are expecting
  2. How fast it grows
  3. Cost of indexing

The more records you are dealing with the higher your risk of bad locking events slowing down your system.

The main killer feature of a database is that backup and restore includes the file content as well as the metadata. It's great for disaster recovery, but beyond that the advantages start diminishing fast.

Using cloud storage basically mitigates most disaster recovery problems with a better support story than traditional databases. You'll still need the offsite archive for full disaster recovery, but that's the same requirement if you had a database.

A plain file-system is most disadvantaged here, but you can stand up Hadoop File System (HFS) and get a cloud blob store within your own network. So there are some options I don't think you've explored yet.

Any other thoughts or comments appreciated.

It sounds like you are moving more to a distributed cloud based solution, but still thinking in traditional development terms. I get it, I am still going through the learning curve with the job I'm in now. We opted for a Blob Store microservice that uses cloud provided storage. That allows us to prevent storing duplicate data, etc.

I'm also supporting a legacy app which had to change how it is storing files in the database so that we could handle larger files. Changing how files are stored in a database, complying with the DB manufacturer's recommendations is a lot more implications than you might think.

I think you will outgrow the file database sooner rather than later.

Storing files in databases can be done, but all (relational) DBMS I am aware of are generally designed to store small pieces of data per entry. You can of course set up an additional DBMS system where only the files are served from (some performance tuning is really recommended then), but in the end Filesystems are already designed to store files, regardless of their size.

So I'm wondering if you have thought about using a distributed filesystem as I believe this would be much easier and more performant. There are many of those systems out there like

Ceph or

LizardFS

to name just a few.

You could then simple expose the distributed FileSystem to the AppServer where it can be treated as a local volume while providing as many storage capacity as needed (those distributed filesystems scale very well).

Hope this helps

Sure, try it.

Most databases support text and blob types for precisely for the purpose of storing large blocks of "unstructured" data. So, most DBMS's have already made considerations for your use-case. But, leave this aspect of your application open to configuration.

Ideally, ensure that metrics are in place to monitor anything that might change negatively — both as a result of this and any other change in usage patterns. Things like CPU utilization, disk IO, network IO, latency, etc.

If you have data to show the impact of your changes and your application is open to configuration, you can have factual conversations about the impact of your decisions — and you can pivot quickly when you make a bad decision.

With data in hand, the opinions of a bunch of strangers on the internet is properly meaningless, and you and your colleagues can have rational, productive conversations.

Licensed under: CC-BY-SA with attribution
scroll top