Is Apache Sqoop really necessary for Apache Hadoop? Is there any alternate way to use database inputs for processing in hadoop?

https://stackoverflow.com//questions/22068153

23-12-2019
|

Question

It is a known fact that hadoop works with MapReduce concept. But it is not logically possible to split a database into blocks of data. For this purpose we have Apache sqoop which imports the contents of a database table to HDFS.

My question is - Is it really that much advantageous to use sqoop with Hadoop? If Yes, can any one explain me with a real time example where hadoop has been implemented to work with MapReduce on databases?

It would be really good if I get to know how MapReduce is implemented in databases related processing.

Thanks in advance.

Solution

BigSQL combines PostgreSQL and Hadoop. MongoDB MapReduce is a pure MapReduce implamentation on "database".

Is that what you're asking?

Otherwise sqoop is great and is widely adopted. Examples: Manufacturing, Healthcare.

OTHER TIPS

Sqoop brings in lot of simplification in terms of import and export data between Hadoop and MySQL. But if we look at the parallelism it supports with more than one map tasks, I would say it consumes a lot more time than the traditional import supported by each databases. (Ex - mysqldump ).

Because if we configure number of maps as 10 by -m 10, Sqoop does the job in two stages.

Applies a query on the table(s) to find out the MIN and MAX values for the --split-by column. (Primary key, if nothing is configured)
Once MIN and MAX values are calculated, depending upon the number of maps, it splits the query with particular small ranges that corresponds to each map task and then again goes to database to fetch the data and populate it in HDFS.

So I would say it consumes X+Y amount of time, where x is amount of time taken by the traditional query or the query ran as the result of first stage in sqoop.

Summary : Sqoop can be used for import and export of data between hadoop and rdbms in very simple way. But it will never help in achieving/completing the task in lesser time.

Each chapter in sqoop documentation provides multiple examples on how to use it, for example : sqoop import example invocations

Generally speaking sqoop is the simplest way to import/export your data between HDFS using MapReduce and SQL databases.

This presentation provides very good introduction into Sqoop usage and internals.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow