Large file / data transfer in a Microservice Architecture

https://softwareengineering.stackexchange.com/questions/283636

08-10-2020
|

Question

My company is currently working on adopting a microservice architecture but we are encountering some growing pains (shock!) along the way. One of the key contention points we are facing is how to communicate large quantities of data between our different services.

As a bit of background we have a document store that serves as a repository for any document we might need to handle across the company. Interacting with said store is done via a service which provides a client with a unique ID and a location to stream the document. The document's location can later be accessed via a lookup with the provided ID.

The problem is this - Does it make sense for all our microservices to be accepting this unique ID as part of their API for the purposes of interacting with documents or not? To me this feels inherently wrong - the services are no longer independent and rely upon the document store's service. While I do acknowledge this might simplify API design and perhaps even have some performance gains the resulting coupling more than counterbalances the benefits.

Does anyone know how the rainbow unicorns (Netflix, Amazon, Google, etc.) handle large files / data exchange between their services?

Solution

Does anyone know how the rainbow unicorns (Netflix, Amazon, Google, etc.) handle large files / data exchange between their services?

Unfortunately I do not know how they deal with such problems.

The problem is this - Does it make sense for all our microservices to be accepting this unique ID as part of their API for the purposes of interacting with documents or not?

It violates the Single Responsibility Principle, which should be inherently in your microservice's architecture. One microservice - logically one, physically many instances representing one - should be dealing with one topic.

In the case of your document store, you have one point, where all queries for documents go (of course you could split this logical unit up into multiple document stores for several kinds of documents).

If your "application" needs to work on a document, it asks the respective microservice and processes its result(s).
If another service needs an actual document or parts of it, it has to ask the document service.

One of the key contention points we are facing is how to communicate large quantities of data between our different services.

This is an architectural problem:

Decrease the need to transfer big amounts of data

Ideally, each service has all of it's data and needs no transfer to simply serve requests. As an extension of this idea - if you have the need to transfer data, think of redundancy (*in a positive way_): Does it make sense to have the data redundant in many places (where they are needed)? Think of how possible inconsistencies might harm your processes. There is no transfer faster as actually none.
Decrease the size of the data itself

Think of how you could compress your data: Starting with actual compression algortihms up to smart data structures. The less goes over the wire, the faster you are.

OTHER TIPS

If the ID returned by your document store is the way to reference documents throughout the system, then it makes sense for all services to accept that 'Document ID' on their API when the service needs to know which document it needs to work with.

This does not necessarily create a tighter coupling between the services than needed. Services that need to access documents need to access the document-store service anyway and they need that ID to tell the store which document to access.
Services that don't access documents directly might need to pass the Document ID along, but to those services it would be just an arbitrary string that does not create a dependency.

Personally, I'd rather not use a separate document store service and document id, but a URL to access the documents (with proper header authentication). With this approach you won't need other services to rely on the document service rather it could just use the full URL to access the document.And also it makes sense when it comes to scaling as well, you could use multiple document stores as and when the storage grows and provide the URL.

However you might need a service(s) to upload a document and to obtain it's URL.

Does anyone know how the rainbow unicorns (Netflix, Amazon, Google, etc.) handle large files / data exchange between their services?

Checkout Amazon S3 REST API specs, seemingly they return full object in bytes. Seems not many options if you are designing a microservice. Amazon S3 response format link

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange