Question

I am thinking of re-developing an application which stores and visualise sensor data. The application is accessible by multiple users and they can add unlimited sensors. We are 10 users right now and we have about 1000 sensors. Number of users will probably not change. The sensors send data every 5 seconds if there is enough power(solar panels).

Right now data is stored in 4 tables.

  • User [id, email, password, etc..]
  • Device [id, name, user_id] user_id: foreign key
  • Sensor [id, device_id, type] device_id: foreign key
  • Data [id, sensor_id, time, data] sensor_id: foreign key

The problem is that Data table gets extremely big. The goal is to retain data for almost a year. I use MySQL and I am very disappointed with the performance of it. Right now I am using flask with Gunicorn and I am using RabbitMQ to queue the storing procedure. Is there anything I can change to increase the performance of the existing system? If you started this thing from scratch what changes you would make? Will NoSQL make a big difference in this situation? I am asking too much but it's the first time I face this kind of problem.

Was it helpful?

Solution

  1. As you have 1k sensors and each generates data every 5 seconds, seems to me like nice example where to use frameworks like Akka to process many requests and avoid many threads issue

  2. Once your processing stage looks like be optimized, you correctly wrote about NoSQL. Guys in comments mentioned missing index but as you have only one table this could cause that every insert to your table triggers index recalculation for all data. This can kill your app's throughput.

    You have a lot of options how to solve this issue. Divide tables to the last contain newest data or use two tables, one for reading and queries and second for writing along with bulk insert from second to first - this is definitely fast using cut off indexes. There is well know problem that you can optimize you storage for massive reading or massive writing, not for both.

    Or you can look at NoSQL, especially Redis come into my mind, look at their data type http://redis.io/topics/data-types-intro

    Redis supports long list by nature. As it does not support any querying in the mean of SELECT ... FROM ... WHERE ... you would have to provide your own indexes and caches in order to provide your required queries. If you would be interested how to use key:value store, just look at their twitter demo. Twitter must solve same issues like you do.

This brings me to my last point. If you want to provide better scalability and you do not know how, just look at facebook, twitter or netflix architecture.

OTHER TIPS

As Martin Podval you should look at a NoSql, but then again you can try a few tricks. First, start partitioning your data into multiple tables. Depending on the time range most often used you can partition for one table for a week or a table per month. Then for time ranges you will have to query multiple tables and combine the results(a small map&reduce job) but multiple querys on smaller tables will prove to be faster then a single query on a large table.

The second trick is to optimize your indexing for the tables, and avoid JOIN operations at all costs.

Lastly you can add caching, this is a really old trick, and it is debated a lot, but 10 users on 1000 sensors for one year, I think there is a good probability that they look at the same data more then once.

I think the best solution is not to just use a NoSQL solution, but something more along the lines of distributed, even with cheep servers you will get better performance. Doing the math you should have about 6.3 billion records for one year. No matter how fast a computer is, and what system(storage system) it uses, it takes a long time to read that data even from memory.

No discussion about telemetry data would be complete without discussing solutions that have already been proven in industry.

HDF5 is one such solution. HDF5 is a data model, library, and file format for storing and managing telemetry data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data.

SQL Server has a FILESTREAM data type that is uniquely suited to handling large telemetry data sets. McClaren Systems uses it to collect telemetry data from Formula One race cars.

Further Reading
Programming with FileStream
McClaren Case Study

Database landscape have changed so much since you asked, but the question remains valid or more so today. To summarize, your needs seem to be the following:

  1. Storing large amount of sensor data / time series data.
  2. Be able to query data fast.
  3. Easily/automatically drop data after a given period: data retention policy

It looks like you need a database optimized for sensor/IoT/time series data. According to DB-Engines over the last 2 years, time series database has gained the most traction. I think it is worth to try these databases as they are optimized for this kind of data. Some notable ones to look at:

  1. InfluxDB open source time series database for sensors and systems.
  2. GridDB open source NoSQL time series database for IoT and Big data.
  3. CrateDB distributed SQL for Industrial IoT

These databases are all designed to store time series/IoT data with fast ingest and query, and have data retention function.

For example, with GridDB your data model would be something like this:

  • A user's container fields user_id, email, password, etc.
  • A device's container for each device with KEY "DEVICES_$userID" and fields deviceId and an sensorIDs array
  • A sensor container for each device with Key "SENSOR_$sensorID" with fields time and data.

Containers work like a table would but partitions the data without tricks where you can quickly query data of individual or a set of sensors. As each sensor data is stored per container, you will not get a bloated table.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top