De-normalization with Hadoop

https://stackoverflow.com/questions/22507517

17-06-2023
|

Question

I'm currently working on a project which uses Hadoop. We are at the beginning of the project.

So first, I have ~50 tables of a relational data base. We extracted them and exported then on HDFS. Now, we want to de-normalize the reference data into the "big table" (to have only 3-4 files). I think I will use map reduce to do the job. I know how I could do it with little tables, but with the big ones....

For example, I have a table "Ticket" with millions of entries and there is a join with a table "Lign" composed of 15 billion entries. I must denormalize them.

My question is, is there any method to apply or best practices ?

Thanks in advance, Angelik

Solution

Writing the joins to perform the denormalization in MR is going to be a time consuming process that is probably not worth the effort, considering the other tools that are certainly available on your Hadoop cluster.

Since you already have the DDL for the tables, and the data is structured, the best method I could recommend is to use Hive instead of raw MapReduce. You'll save yourself a lot of time and problems.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow