Question

Scenario:

I have imported the data from SQl Server to HDFS. The data stored in HDFS directory in a multiple files as:

part-m-00000
part-m-00001
part-m-00002
part-m-00003

Question:

My question is that While reading this stored data from HDFS directory we have to read all file (part-m-00000,01,02,03) or just part-m-00000. Because when I read that data, I found that the data inside HDFS is little bit missing. So, is it happens or something I missed out?

Was it helpful?

Solution

You need to read all the files, not just 00000. The reason there are multiple files is that sqoop works in a map-reduce fashion, splitting the "import" work to multiple parts. The output from each part is put in a separate file.

RL

OTHER TIPS

Sqoop is running the import with no reducers.As a result,there is no consolidation for the part files which were processed by the mappers.Hence you will see part files depending upon the number of mappers you have set in the sqoop command as --m4 or --num-4.So if you provide sqoop import --connect jdbc:mysql://localhost/db --username <>--table <>--m1 then it will create only one part file.

If your result size is huge, then Hive will store the result in chunks. And If you want to Read those all files using CLI, then execute below command.

$ sudo cat part-m-*

It will give you final result without any of missing part.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top