How should data transfer between a client and a web API for normalized data be designed?

https://softwareengineering.stackexchange.com/questions/251519

04-10-2020
|

سؤال

I want to design an API backed by some database (doesn't really matter which, but to make the discussion more interesting, let's say it's Mongo - explanation below) which sends data to a client.

The database contains several types of records. Some of them reference other types of records.
It's not uncommon for a record to be referenced by several other records from different types. Thus the data on the DB is normalized.

What are the considerations in designing an API server which sends records to the client?

Two options come to mind (you're invited to suggest more or correct me on those):

The API is granular. Send normalized data. Let the client ask for more records based on what it receives. The client may have a cache, it may decide it doesn't need to ask the server for everything.
The API sends all the records the client might possibly need based on the requested data. Thus effectively denormalizing the data.

With option 1, the client may make more HTTP requests so it can have the complete data it needs. It means more network communication, which may make the total data transfer slower. The server is simpler though, and the client can selectively ask only for the records it doesn't already have.

With option 2, less HTTP requests. But we may send the client data it already has (maybe it already received, and cached, some of the records in a previous request). The server is more complicated. Especially if it's not RDBMS. No joins in Mongo, so we have to query the DB more than once to get all the data.

Further assumptions:

The data changes every few days (2-3 times a week). So the client can potentially have a persistent cache.
The Mongo queries are a bit slow (millions of documents in each collection).
In each such session about 2MB of data will be sent to the client.

المحلول

Option 1

While it is tempting to have a set of lightweight APIs that send the normalized data, there are some potential pitfalls with this approach.

First, the lightness of these APIs might end up being paid for with tight coupling between your normalized data model and your APIs. If you want to change your data model or APIs, you will have to change the other or engineer a means of preserving the old approach on the other side of the change. This complicates maintenance.

Second, you are correct to note that there will be lots of HTTP calls from the client to your APIs. Put yourself in the role of the implementor of the client and ask whether you really want to make all those API calls. With proper error checking and exception handling, it gets to be a lot of work.

Option 2

On the other hand, sometimes a "one call does it all" approach is the best one. The "send everything" approach of option 2 can be a lot easier for the client, and due to its denormalized nature, provides a de facto interface between client and server that should be straightforward to maintain as the two develop and evolve separately. The price with this is speed and size. As you note, it could take a while to assemble all that data, then transmit it, especially if only a small portion of it is needed. But don't forget, it will be a lot faster to fetch all the data over the LAN in your data center than for the client to do so over the open internet.

Recommendation

I lean toward option 2, though I propose adding a dose of option 1 to strike a balance between simplicity and performance. If some data takes exceptionally longer than average to assemble, then see if you can leave it out of the main API and put it in a separate one. Remember, the goal is to make it easy on both the server and the client.

Caching Caveats

Since the data is changing every 2-3 days, be careful about caching. Since the data is so big and expensive to assemble, caching it tempting and probably a good idea. However, since it is changing regularly, be sure to refresh the cache and take steps to force the client to fetch new data when it is available. Techniques like cache-busting API parameters, observation times, expiration times, and the like could be of service here.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى softwareengineering.stackexchange