Question

In a few months I will start to write my bachelor-thesis. Although we only discussed the topic of my thesis very roughly, the main problem will be something like this:

A program written in C++ (more or less a HTTP-Server, but I guess it doesn't matter here) has to be executed to fulfill its task. There are several instances of this program running at the same time, and a loadbalancer takes care of equal distribution of http-requests between all instances. Every time the program's code is changed to enhance it, or to get rid of bugs, all instances have to be restarted. This can take up to 40 minutes, for one instance. As there are more than ten instances running, the restart process can take up to one work day. This is way to slow.

The presumed bottleneck is the access to the database during startup to load all necessary data (guess it will be a mysql-database). The idea of the teamleader to decrease the amount of time needed for the startup-process is to serialize the content of the database to a file, and read from this file instead of reading from the database. That would be my task. Of course the problem is to check if there is new data in the database, that is not in the file. I guess write processes are still applied to the database, not to the serialized file. My first idea is to use apache thrift for serialization and deserialization, as I already worked with it and it is fast, as far as I know (maybe i write some small python programm, to take care of this). However, I have some basic questions regarding this problem:

  • Is it a good solution to read from file instead of reading from database. Is there any chance this will save time?
  • Would thrift work well in this scenario, or is there some faster way for serialization/deserialization
  • As I am only reading, not writing, I don't have to take care of consistency, right?
  • Can you recommend some books or online literature that is worth to read regarding this topic.

If I'm missing Information, just ask. Thanks in advance. I just want to be well informed and prepared before I start with the thesis, this is why I ask.

Kind regards

Michael

Was it helpful?

Solution

Cache is king

As a general recommendation: Cache is king, but don't use files.

Cache? What cache?

The cache I'm talking about is of course an external cache. There are plenty of systems available, a lot of them are able to form a cache cluster with cached items spread across multiple machine's RAM. If you are doing it cleverly, the cost of serializing/deserializing into memory will make your algorithms shine, compared to the cost of grinding the database. And on top of that, you get nice features like TTL for cached data, a cache that persists even if your business logic crashes, and much more.

What about consistency?

As I am only reading, not writing, I don't have to take care of consistency, right?

Wrong. The issue is not, who writes to the database. It is about whether or not someone writes to the database, how often this happens, and how up-to-date your data need to be.

Even if you cache your data into a file as planned in your question, you have to be aware that this produces a redundant data duplicate, disconnected from the original data source. So the real question you have to answer (I can't do this for you) is, what the optimum update frequency should be. Do you need immediate updates in near-time? Is a certain time lag be acceptable?

This is exactly the purpose of the TTL (time to live) value that you can put onto your cached data. If you need more frequent updates, set a short TTL. If you are ok with updates in a slower frequency, set the TTL accordingly or have a scheduled task/thread/process running that does the update.

Ok, understood. Now what?

Check out Redis, or the "oldtimer" Memcached. You didn't say much about your platform, but there are Linux and Windows versions available for both (and especially on Windows you will have a lot more fun with Redis).

PS: Oh yes, Thrift serialization can be used for the serialization part.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top