Domanda

I'm trying to figure out the difference (between tools/services/programs) between Data Warehouse, Clustered Data Processing and the tools/infrastructure for querying a Data Warehouse

So Let's say I have the following setup to perform some data processing for a certain use case

Hadoop Cluster for Distributed Data processing
Hive for providing infrastructure and Functions for querying data from a data warehouse
My data sitting in an RDBMS or a NoSQL database

In the above example, what exactly is the Data Warehouse? My naive brain thinks that it is the RDBMS or the NoSQL database in the above context is the Data warehouse. But by definition, isn't a Data warehouse a database used for reporting and data analysis? (Definition shamelessly stolen from Wikipedia). So can I call a traditional RDBMS/NoSQL database a Data Warehouse? Thanks.

È stato utile?

Soluzione

You cannot call every relational database system a data warehouse, since one of data warehouses main feature is to aggregate data from multiple databases (with different schemas). It is usually done with a "star schema" allowing to combine multiple dimensions and multiple granularities.

Because NoSQL database systems (graph-based or map-reduce-based) are schema-less they can indeed store data from different schemas. Moreover Map-Reduce can be used to aggregate data with different granularities (e.g. aggregate daily data to compare them with monthly data).

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top