Question

I have a dozen of databases that stores different data, and each of them are 100TBs in size. All of the data is stored in AWS services such as RDS, Aurora and Dynamo.

Many times I find myself need to perform "joins" across databases, for example a student ID that appears in multiple databases with data that I want to gather. The joins are usually done after data is streamed out of the database, since the data is not located in the same database, and this sometimes requires hours just for thousands of records.

Can services such as AWS redshift or Google BigQuery allow you to somehow "import" data from many data sources and then you can perform SQL queries to join them?

How about Hadoop and Hive? Where we steam data out from the database and place it as files in Hadoop, and let Hive Query the data?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top