How to store large amounts of _structured_ data?

https://softwareengineering.stackexchange.com/questions/339338

04-01-2021
|

Pergunta

The application will continuously (approximately every second) collect the location of users and store them.

However, there is too much data. There will be 60 × 60 × 24 = 86,400 records per user, daily. Even with 1000 users, this means 86,400,000 records daily.

And it is not only 86,400,000 records daily. Because these records will be processed and the processed versions of them will be stored as well. So, multiply that number with approximately 2.

How I plan to use the data

Essentially, I plan to make coarser grained versions of location data for easier consumption. That is:

Sort the received data w.r.t timestamps.
Iteating on this list in order, determine if the location has changed significantly (by checking out how much the latitude and longitude changed)
Represent the non significant location changes as a single entry in the output (hence, output is a coarser grained version of the location data).
Iterate this process on the output, by requiring an even larger latitude and longitude change for a significant change. Hence, the output to be produced from the previous output will be even more coarse grained.
Iterate the whole process as much as needed.
Aggregate a range of resolutions and send them to users. Also, store all resolutions of the data for later consumption.

What should I use to store this data? Should I use a relational database or a NoSQL solution? What other things should I consider when designing this application?

Solução

Some alternatives for storing this data:

Message queue (possibly distributed), like Apache Kafka

This will be optimized for writing and reading a stream of data. It is ideal for collecting data streams in an easy to process format, but it cannot typically be queried except by reading out the stream in its entirety. So, this would be either for archival purposes, or an intermediate step on the way to a processing layer.

Relational database(s)

You can just write it to the database, and when the volume exceeds the capacity of the DB to handle, you can shard the database (= have multiple subsets of the data sit on different database servers). Benefit: you can use a relational DB and don't have to learn anything new. Downside: all code dealing with the DB must be aware on which shard which piece of data lives, aggregated queries must be done in application software.

Distributed NoSQL database, like Cassandra.

You write your data to a distributed NoSQL database, and it will automatically shard the data for you. Cassandra allows you to do queries across the cluster, requiring less application code to get back at the data. Benefit: more naturally suited for large amounts of data, downside: will require specific expertise and deep understanding of the mechanics of how these systems work to achieve good performance and make the data queryable according to your needs. NoSQL is not a magic performance fix, it is a set of trade-offs which must be understood to be navigated.

Hadoop / file

The data is appended to files which are distributed automatically across servers by the Hadoop platform, processed on those servers using tools like M/R or Apache Spark, and finally queried (as file) using a Hadoop SQL engine like Hive or Impala.

Which to choose?

The trade-offs between these alternatives are complex, and they very much depend on both your write and your read patterns, so the only person who can decide on these trade-offs is you. If you lack the time to build up a deep understanding of these alternatives, then just use a relational DB and figure out a sharding solution as you go along. In all likelihood, YAGNI.

Outras dicas

Look into your requirements a little deeper. There is a way to create the illusion of tracking position every second.

If you have an app that knows your current GPS location and writes it to a database, why would you keep writing the location if it doesn't change? Even if you require the data, if the user has been asleep for 7 hours, you can programmatically fill-in the missing time slots with a duplicate location to do your calculations or mapping or whatever else you need to do.

If you do track the location every second, do you have to store these data forever? You can archive the records to another database to prevent the current table from getting too large. Or you could even just keep the records where there is a position change. This is common in data warehouses.

Your data is a set of time series. You have given sets of numbers (two per user) that evolve with time. Typically, you are NOT looking for any kind of relational storage, but rather a RRD storage. These storage heavily focuses on reducing I/O work of numerous small writes by buffering it.

Relational storage is an heresy for this volume of time series. However, be warned that the development of RRD is quite not as well supported in terms of programmable exploitations than the SQL is. You are probably looking at serious integration work, but it's hardly avoidable given your requirements.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange