Question

I am currently working on a project that compiles using JDK1.7, creates and runs Hadoop jobs using Cascading 1.2 (soon to be upgraded to 2.1) and uses a Cloudera distribution of Hadoop (0.20.2-cdh3u3).

I'm looking at how to modify my Cascading/Hadoop jobs to read and write all data to/from a MySQL db. It's looks like SQOOP may be able to handle that.

However from what I've seen so far there is little information or documentation on how to do this in Java (I understand SQOOP is mainly supposed to be used for batch jobs called from within a shell) - the Java examples I have followed haven't worked for me. I have tried using SQOOP 1.4 and switching my project to use JDK1.6 as I believe this is required, (although it will break other parts of my project) but I still couldn't get it to work.

Does anyone know if what I'm trying to achieve is even possible? How are other people dealing with this problem? Will the release of SQOOP2 help at all?

The kind of errors I'm seeing when I try to run an org.apache.sqoop.tool.ExportTool to export a CSV to a table are:

Can't initialize javac processor due to (most likely) a class loader problem: java.lang.NoClassDefFoundError: com/sun/tools/javac/processing/JavacProcessingEnvironment

Note: \tmp\sqoop-my.name\compile\9031edc8e43167c10f9f895b64aa79d5\MyTableName.java uses or overrides a deprecated API.

Encountered IOException running export job: java.io.IOException: Could not load jar \tmp\sqoop-my.name\compile\9031edc8e43167c10f9f895b64aa79d5\MyTableName.jar into JVM. (Could not find class MyTableName.)

Was it helpful?

Solution 3

Thanks Charles and Vikas. This certainly put me on the right track. I ended up using https://github.com/cwensel/cascading.jdbc which uses Hadoop classes DBInputFormat/DBOutput to make it easy to set up Cascading jobs that read and write to db.

To write I just changed the output flow of my tap to:

String url = "jdbc:mysql://localhost:3306/mydb?user=myusername&password=mypassword";
String driver = "com.mysql.jdbc.Driver";
String tableName = "mytable";   
String[] columnNames = {'col1', 'col2', 'col3'}; //Columns I want to write to 
TableDesc tableDesc = new TableDesc( tableName );

JDBCScheme dbScheme = new JDBCScheme( columnNames );
Tap dbOutputTap = new JDBCTap( url, driver, tableDesc, dbScheme );

And to read from the db I just made a tap that looked like this:

String url = "jdbc:mysql://localhost:3306/mydb?user=myusername&password=mypassword";
String driver = "com.mysql.jdbc.Driver";
String tableName = "mytable";      
String[] columnNames = {'col1', 'col2', 'col3'}; //Columns I want to read from 
TableDesc tableDesc = new TableDesc( tableName );

JDBCScheme dbScheme = new JDBCScheme( columnNames, "col1<40" );
Tap dbInputTap = new JDBCTap( url, driver, tableDesc, dbScheme );

I came across Cascading-DBMigrate as well but it seems this is only for reading from db's and not writing to them.

OTHER TIPS

Sqoop is designed for exporting/importing data between MySQL/other relational databases and Hadoop/HBase. A very good tutorial on sqoop can be found here which explains its various functionalities. Not sure if this is what you want to do.

In case you need to read/write data from/to MySQL in MapReduce jobs, DBInputFormat/DBOutput hadoop classes can be used as suggested by @Charles

If you just want to write your job output to MySQL, I would recommend using a different output format called DBOutputFormat as described here:

A companion class, DBOutputFormat, will allow you to write results back to a database. When setting up the job, call conf.setOutputFormat(DBOutputFormat.class); and then call DBConfiguration.configureDB() as before.

The DBOutputFormat.setOutput() method then defines how the results will be written back to the database. Its three arguments are the JobConf object for the job, a string defining the name of the table to write to, and an array of strings defining the fields of the table to populate. e.g., DBOutputFormat.setOutput(job, "employees", "employee_id", "name");.

The same DBWritable implementation that you created earlier will suffice to inject records back into the database. The write(PreparedStatement stmt) method will be invoked on each instance of the DBWritable that you pass to the OutputCollector from the reducer. At the end of reducing, those PreparedStatement objects will be turned into INSERT statements to run against the SQL database.

Where "as before" refers to this instruction:

DBConfiguration.configureDB(conf, “com.mysql.jdbc.Driver”, “jdbc:mysql://localhost/mydatabase”);

To read from MySQL it's all the same with DBInputFormat.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top