Question

I have a scenario where i need to load data and store into hdfs using Pig, and this result(pig output/part file data) should be loaded to two tables in mysql using Sqoop.

Is this can be done by Sqoop? any other Solutions.

For example if i have a file like this

    col1 col2 col3 col4
    .... .... .... ....
    .... .... .... ....
    .... .... .... ....

I want to export col1,col2 to table table1 and col3,col4 to table table 2 of some database

Thanks in advance.

Was it helpful?

Solution

I'm using MySQL in the solution below but the same should be applicable for the other databases.

Create the following flat file on HDFS:

$ hadoop fs -cat sqoop_export
W1, X1, Y1, Z1
W2, X2, Y2, Z2
W3, X3, Y3, Z3

Create two tables in MySQL:

mysql> create table A (col1 VARCHAR(20), col2 VARCHAR(20));
Query OK, 0 rows affected (0.08 sec)

mysql> create table B (col3 VARCHAR(20), col4 VARCHAR(20));
Query OK, 0 rows affected (0.01 sec)

Then create a stored procedure that takes four input values and then inserts the first two in the first table and the last two into the second table:

mysql> delimiter //
mysql> CREATE PROCEDURE insert_two_tables (IN c1 VARCHAR(20), IN c2 VARCHAR(20), IN c3 VARCHAR(20), IN c4 VARCHAR(20)) BEGIN INSERT INTO A(col1, col2) VALUES(c1, c2); INSERT INTO B(col3, col4) VALUES(c3, c4); END//
Query OK, 0 rows affected (0.04 sec)

Now use sqoop export but don't specify table name, instead use --call option for invoking the stored procedure created above:

$ sqoop export --connect jdbc:mysql://localhost/sqoop_export --username xyz --password test --call insert_two_tables --export-dir sqoop_export

The export process completes successfully:

14/03/24 17:52:53 INFO mapred.JobClient:     Physical memory (bytes) snapshot=668643328
14/03/24 17:52:53 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=7584153600
14/03/24 17:52:53 INFO mapred.JobClient:     Total committed heap usage (bytes)=1175584768
14/03/24 17:52:53 INFO mapreduce.ExportJobBase: Transferred 691 bytes in 16.8329 seconds (41.0506 bytes/sec)
14/03/24 17:52:53 INFO mapreduce.ExportJobBase: Exported 3 records

Now verify that the two tables have the data we have looking for:

mysql> select * from A;
+------+------+
| col1 | col2 |
+------+------+
| W3   |  X3  |
| W2   |  X2  |
| W1   |  X1  |
+------+------+
3 rows in set (0.00 sec)

mysql> select * from B;
+------+------+
| col3 | col4 |
+------+------+
|  Y3  |  Z3  |
|  Y2  |  Z2  |
|  Y1  |  Z1  |
+------+------+
3 rows in set (0.00 sec)

So using a stored procedure, one flat file on HDFS can be exported to multiple tables on database.

If you don't want to use stored procedure, then the alternative would be to create two flat files on HDFS using pig - one that has col1, col2 and the other that has col3, col4. Then you can do two sqoop exports individually on each of the flat files onto the corresponding tables on database.

Sqoop export features --columns option but this is useful only when you have more columns on database end relative to the flat file on HDFS. It seems for the other direction a stored procedure is necessary. As per SQOOP-749 ability to invoke stored procedure without specifying tables is available in Sqoop versions 1.4.3 and above. I have used Hadoop 2.0.0 and Sqoop 1.4.3 in the above example.

OTHER TIPS

Easy! Just use PIG to split you data into two relations:

-- given A:
A = 
col1 col2 col3 col4
.... .... .... ....
.... .... .... ....

-- split into two relations:
Acols1_and_2 = FOREACH A GENERATE col1, col2;
Acols3_and_4 = FOREACH A GENERATE col3, col4;

-- store those relations in HDFS
...

Then run sqoop export twice, once for each relation.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top