Question

I have few questions:

1. Why is there MapReduce process in Sqoop to load data from HDFS to MySQL? 

e.g.

Data is in HDFS on directory: /foo/bar

To load data in MySQL bar table, why is there a MapReduce process?

sqoop export --connect jdbc:mysql://localhost/hduser --table foo -m 1 --export-dir /foo/bar

After entering above command, MapReduce process executes.

2. How can I enable/disable key in MySQL using Sqoop/Oozie?

Since huge data is getting loaded to MySQL, we need to use enable/disable. How do I achieve it?

3. How to run multiple Oozie jobs in parallel? 

4. How to run Oozie jobs in Cron?

You can answer 1 or more questions.

Thank you.

Was it helpful?

Solution

I'll go through your questions one by one. Feel free to ask more questions in the comments and I will elaborate on the things that are unclear to you.

1. Why is there MapReduce process in Sqoop to load data from HDFS to MySQL?

This is because Sqoop is based on MapReduce. If you consider how files are stored in HDFS, they are split into small chunks and these chunks are stored across the cluster (some of the chunks might be on the same node). So it makes perfect sense to have a MapReduce job where the Map tasks read all these chunks of data in parallel and write them to MySQL.

2. How can I enable/disable key in MySQL using Sqoop/Oozie?

I don't know the answer to this one. However I feel that your question is a little ambiguous. Please try adding some more details & If I find something I'll get back on this.

3. How to run multiple Oozie jobs in parallel?

Each Oozie job is defined by a workflow.xml and a job.properties.

  • If you're talking about manual execution of multiple Oozie workflows (jobs), you can do this by simply running the command to start Oozie jobs for all the jobs you want to run in parallel. Sample command: oozie job -config job.properties -run

  • If you're talking about running multiple actions within an Oozie workflow in parallel, you can have a fork to trigger off multiple actions in parallel & then a join point for the parallel actions to meet upon completion. Example:

    <fork name = 'sampleFork'>
       <path start = 'sampleAction1'/>
       <path start = 'sampleAction2'/>
    </fork>
    
    <action name = 'sampleAction`>
      ..
      ..
      ..
      <ok to = 'joinActions'/>
      <error to = 'fail'/>
    </action>
    
    <join name = 'joinActions' to 'seqAction3'/>
    

4. How to run Oozie jobs in Cron?

If you want to automate execution of Oozie jobs, I suggest you look into Oozie coordinator. Using oozie coordinator, you can schedule workflows to trigger off after every interval (10 mins, 1 hour, 1 day etc. ).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top