How to create a data pipeline from hive table to relational database

https://stackoverflow.com/questions/15000245

10-03-2022
|

Question

Background :

I have a Hive Table "log" which contains log information. This table is loaded with new log data every hour. I want to do some quick analytics on logs for past 2 days, so i want to extract last 48 hours of data into my relational database.

To solve the above problem I have created a staging hive table which is loaded by a HIVE SQL query. After loading the new data into the staging table, i load the new logs into relational database using sqoop Query.

Problem is that sqoop is loading data into relational database in BATCH. So at any particular time i have only partial logs for a particular hour.

This is leading to erroneous analytics output.

Questions:

1). How to make this Sqoop data load transactional, i.e either all records are exported or none are exported.

2). What is best way to build this data pipeline where this whole process of Hive Table -> Staging Table -> Relational Table.

Technical Details:

Hadoop version 1.0.4
Hive- 0.9.0
Sqoop - 1.4.2

Solution

You should be able to do this with sqoop by using the option called --staging-table. What this does is basically act as an auxiliary table that is used to stage exported data. The staged data is finally moved to the destination table in a single transaction. So by doing this, you shouldn't have consistency issues with partial data.

(source: Sqoop documentation)

OTHER TIPS

Hive and Hadoop are such great technologies that can allow your analytics to run inside MapReduce tasks, performing the analytics very fast by utilizing multiple nodes.

Use that to your benefit. First of all partition your Hive table. I guess that you store all logs in a single Hive table. Thus when you run your queries and you have a

SQL .... WHERE LOG_DATA > '17/10/2013 00:00:00'

Then you effictivelly query all the data that you have collected so far. Instead if you use partitions - let's say one per day you can define in your query

WHERE p_date=20131017 OR p_date=20131016

Hive is partitioned and now knows to read only those two files So let's say you got 10 GB of logs per day - then a HIVE QUERY should succeed in a few seconds in a decent Hadoop cluster

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow