Using a relational and a non-relational database in the same project?

https://softwareengineering.stackexchange.com/questions/342956

07-01-2021
|

Pergunta

I ran into a problem on the project I'm working on right now. It is basically an app that stores a path based on GPS coordinates. The app runs on android an saves the GPS location every second and then sends it to on API. I think if I insert a record for every location saved there will be to many records in the table very fast. For example if I go on a run three times a week for 1 hour there will be 10'800 new records per week, now imagine this with 1000 active users for a year...

Anyway so I've come up with an idea, that I've never seen before and I'm therefor not sure it is good:

I use a relational database (MySql) to store users (and all other data expect of the recorded paths) and then I have a table users_paths which links users to their recorded paths (obviously), the path itself is stored in a noSql (MongoDB) database in a document like this:

_id:3474348347389,
waypoints:{
  {lat, long},
  {lat, long},
  ...
}

I haven't yet implemented it because it feels wrong and a bit like an overkill to me. I have also thought to save the recorded paths as a json-file but I'm not happy with that solution either.

What do you think? Is this "the way to go" or am I completely wrong?

Solução

For example if I go on a run three times a week for 1 hour there will be 10'800 new records per week, now imagine this with 1000 active users for a year.

Well, let's not imagine, but actually estimate the data growth. Imagine each GPS coordinate is stored in two 32-bits variables (largely enough; probably you don't need that much precision.) Three hours per week means 10,800 records, or 675 KB of memory. For one thousand users, we obtain 659.18 MB of data increase per week, or 2.6 GB per month or 33.57 GB per year.

Therefore, it will take you sixty years to fill a hard disk of a capacity of 2 TB.

Back to your original question, with such small set of data, the choice between RDBMS and non-relational databases really doesn't matter. Pick the one you are familiar with.

Outras dicas

For your specific use case I would not use two different databases. Just save your users' paths as a geometry in your RDBMS (be it MySQL or Postgres). Modern relational databases support geospatial datatypes and allow comfortable access. This way you can do your geospatial analysis (like length of run, speed, intersections with other users, ...) in your database.

Trying to create your own custom geometry datatype (i. e. defining some kind of document schema in mongodb) will be a classic example of "reinventing the wheel". All the major database engines are pretty good at storing, querying and manipulating geometries and geo data.

Have a look at PostGIS or Spatial Extensions for MySQL if you want to use Open Source databases. Both Oracle and MSSQL also support spatial data.

Doing it this way will allow you to use your data with standard tooling like exposing it as WMS, WFS or any other kind of spatial rendering.

The only way to know for sure is to implement both and measure during a load test.

But intuitively, I think that tinkering with two different databases can't be a good idea, because none of the DBMS can do a global optimization on related data accesses. This is overkill and will not improve performance so much.

Either put all your data in mongoDB or all your data in the rdbms. Your MongoDB model is just fine. For the rdbms, you could use a waypoints table. The numbers you quote are not a problem for the retrieval. Rdbms are designed to mass-process such data.

If in the rdbms scenario you don't need database access to single points of your path, you could chose to store the full path as a single blob (binary storage of your gps coordinate stream), which will avoid the database to interpret this lot of data at each row fetching. The blob would then be a black box for the rdbms. It would be handled by your app for rendering the path graphically or compute attributes such as distance and speed, or segment speed.

Note that if you do intend to access in db queries to single points of the path (for example to see if two runners use similar path or could cross each other), then, depending on gps resolution and precision, the single point might not sufficient anyway. You should then better use a database engine the supports geospatial queries and indexes (e.g. MongoDB or Aerospike)

While the other answers showed that the amount of data is not so big to fill a 2TB disk, you should consider access speeds to that data as well. If you use traditional computer hard disks and not SSDs, they can do only about 100 random accesses per second (perhaps a little less for 7200 RPM desktop disks and a little more for 10000-15000 RPM enterprise disks).

Typical relational databases store the information on a flat file, meaning that if there are 1000 active users, you will have the following structure: user0data0 user1data0 user2data0 ... user999data0 user0data1 user1data1 user2data1 ... user999data1 ... meaning that fetching many data points belonging to one user means one random access per data point.

Now, if your entire dataset doesn't fit into memory (typical servers have 32-64 GB memory, and you will fill that amount in 1-2 years), if you want e.g. to obtain the last day's data points for a random user, it takes 86 400 random accesses, which is 864 seconds or over 14 minutes. Do you have the possibility to wait for 14 minutes for the last day's data? Probably not.

What if you store a given user's data points inside a single document such as a file? The information of files is typically stored consecutively in disk (although some newer file systems such as ZFS and btrfs break this assumption, but let's assume you are using traditional file system such as ext2, ext3 or ext4). Now to fetch 86400 data points, it requires one random disk seek and sequential scan of 86400 data points. At 8 bytes per data point, it's 0.66 megabytes taking about 7 milliseconds to read (assuming you read only the data points of interest and not the whole document). This added to the seek of 10 milliseconds is 17 milliseconds. If you read the whole document for a year, it's 2.6 seconds sequential scan and 10 milliseconds random seek, which may be a problem.

So, I would consider breaking the documents to smaller pieces: one document per day per user.

So, as a summary, SQL databases are not the technology to use to store the location data. Your idea is good, but might require refinement (breaking the large file into smaller per-day pieces).

Whatever you do, please implement tests to populate the database with data, in the same order where the data would arrive in a real system. Then do random queries to the data, e.g. to obtain the path of a random user in a random day, and measure the performance of those queries.

Edit: There may be some database-dependent technologies that could allow reducing the overhead of random accesses. For example, recent versions of PostgreSQL support index only scans. This means that if you create an index that contains all columns you access in the query, the query will be satisfied only from the index. MySQL InnoDB supports clustered indexes, where the data is stored in the index instead of a flat file. However, by using these technologies, you are making your program's performance depend on internal implementation details of the database. Do you want that? If you are certain that you will not switch to another database that lacks these features, you could with these features obtain acceptable performance. However, if you want to make your program database-independent, storing the location data elsewhere is a good idea.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange