Domanda

I have a general question about design pattern for an enterprise application. I read a lot about it but its actually hard to find an answer because most you find it rater about how to design a data warehouse (DW) or about how to design pipeline, ETL and so on. But I'm stucking on a more top-layer question.

Setup

I have a IoT base business model basically like this:

  • Project
    • Location
    • Customer
    • [...]
    • DataSource
      • Device -ValueType
        • Data (not include in the relational database)

Its currently persisted by a relational database, where we use a lot of features like spatial types, hierarchyids, json types and more that are not available in a DW.

Effort taken

We have a very well-designed data pipeline to solve the ETL process. Works great and a lot of data is coming in.

For storing mass data I now started to design a data warehouse model (snowflake) to allow effective storing into a DW.

Looks currently like this:

Snowflake schema

Question

My actual question is more in general about where to put what data. I currently have the idea of keeping the relational database as it is and creating a separate DW with the schema I shared. From our business logic (API service) we need then to receive data from both storages to give valid results to a (for example) web application. From a data science perspective you can use the DW for doing ML, BI and other analytics stuff.

Question:

  1. Is this really common practice to have storages side-by-side like this?
  2. Do I need to stored device data also somehow inside the relational database?
  3. What is happening (on separated storage) to the loose related DIM tables inside the warehouse when an entity like location changed inside the relational database?
  4. Am I still a too big dummy and should I read more? (then ignore 1 -3) ;)

Thanks for reading!

È stato utile?

Soluzione

Is this really common practice to have storages side-by-side like this?

Absolutely.

For many businesses, the relational DB is used for "online transaction processing" (OLTP), which holds the "current live data". The data warehouse infrastructure however is founded on things like "OLAP cubes", which may utilize different DBMS products and/or technologies optimized for analytics. The content of the data warehouse is usually based on a frozen snapshot of the OLTP database or other sources (daily, weekly, monthly, depending ont the specific requirements and the available resources.)

From what you wrote, it appears to me your system is designed exactly that way.

Do I need to stored device data also somehow inside the relational database?

Not necessarily, this depends on if your business requires that data directly inside any OLTP process, or not. The snapshot for the OLAP cubes may be filled from more than just the relational DB, and if you have device data stored in some kind of protocol files outside the relational DB, the ETL may be designed in a way it uses both sources.

What is happening (on separated storage) to the loose related DIM tables inside the warehouse when an entity like location changed inside the relational database?

Then there is a temporary inconsistency between your data warehouse and your "life" database (until the next snapshot is created). Fullstop.

Such scenarios will happen - and that is ok. The data inside the OLTP and the OLAP system represent slightly different points in time, so this is quite normal.

If you have applications which query both systems at the same time, those applications have to expect such temporary inconsistencies and deal with them. How that has to be done precisely depends heavily on the specific case, there are several ways to solve this, for example

  • do nothing and let the user deal with the differences

  • forbidding "changes" of existing entities in dimensional data in the OLTP DB, only inserts, so they form at least a superset of the dimensions in the OLAP system

  • initiating a new snapshot earlier to recreate a new cube faster after a change in dimensional data (and maybe block access to the OLTP for a short period)

  • for cases of correcting some data more quickly, having an process which updates both systems without initiating a full snapshot

There is no silver bullet to this, you have to analyse the gory details of the business requirements here and design the system accordingly.

Altri suggerimenti

  1. Yes, it is very common to have a separate database for a data warehouse. There are two main drivers behind that:

    1. Protecting the performance and throughput of the transactional system since heavy user queries could affect its performance;
    2. The need for a different data schema for that is not necessarily aligned with the best fit for your transactional data.
  2. No, device data does not necessarily need to be stored in your database. But in the end it depends on your business and analytical needs:

    1. Your pipeline may very well aggregate the relevant data from the device, so that the details are no longer needed. That's by the way one of the ideas behind edge computing.
    2. For your datawarehouse, the aggregates could also be sufficient. But depending on the domain, detailed data could be useful. Especially if you want to make correlations in order to elaborate a predictive model (e.g. short term variations in the measurements that could be related to a subsequent defect, if your IoT is a sensor on some industrial equipment). You may also want to do other aggregations than what you need for your daily operational business.
    3. But the accumulated raw data of hundreds of thousands IoT devices sending data several times per seconds might simply be too much. Maybe the detailed data makes sense only over a shorter period.
  3. WHat happens to your dIM tables in case of master data changes depends on the needs. A popular approach is to keep datawarehouse data historical because new locations may be added, old locations may be dropped, but the DW needs the accurate historial truth.

  4. I don't think so. Your initial thoughts are already very interesting; BigData is an area of research. The information you are looking for might not be in a book yet.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
scroll top