Question

I am currently attempting to write a map-reduce job where the input data is not in HDFS and cannot be loaded into HDFS basically because the programs using the data cannot use data from HDFS and there is too much to copy it into HDFS, at least 1TB per node.

So I have 4 directories on each of the 4 nodes in my cluster. Ideally I would like my mappers to just receive the paths for these 4 local directories and read them, using something like file:///var/mydata/... and then 1 mapper can work with each directory. i.e. 16 Mappers in total.

However to be able to do this I need to ensure that I get exactly 4 mappers per node and exactly the 4 mappers which have been assigned the paths local to that machine. These paths are static and so can be hard coded into my fileinputformat and recordreader, but how do I guarantee that given splits end up on a given node with a known hostname. If it were in HDFS I could use a varient on FileInputFormat setting isSplittable to false and hadoop would take care of it but as all the data is local this causes issues.

Basically all I want is to be able to crawl local directory structures on every node in my cluster exactly once, process a collection of SSTables in these directories and emit rows (on the mapper), and reduce the results (in the reduce step) into HDFS for further bulk processing.

I noticed that the inputSplits provide a getLocations function but I believe that this does not guarantee locality of execution, only optimises it and clearly if I try and use file:///some_path in each mapper I need to ensure exact locality otherwise I may end up reading some directories repeatedly and other not at all.

Any help would be greatly appreciated.

Was it helpful?

Solution

I see there are three ways you can do it.

1.) Simply load the data into HDFS, which you do not it want to do. But it is worth trying as it will be useful for future processing

2.) You can make use of NLineInputFormat. Create four different files with the URLs of the input files in each of your node.

file://192.168.2.3/usr/rags/data/DFile1.xyz
.......

You load these files into HDFS and write your program on these files to access the data data using these URLs and process your data. If you use NLineInputFormat with 1 line. You will process 16 mappers, each map processing an exclusive file. The only issue here, there is a high possibility that the data on one node may be processed on another node, however there will not be any duplicate processing

3.) You can further optimize the above method by loading the above four files with URLs separately. While loading any of these files you can remove the other three nodes to ensure that the file exactly goes to the node where the data files are locally present. While loading choose the replication as 1 so that the blocks are not replicated. This process will increase the probability of the maps launched processing the local files to a very high degree.

Cheers Rags

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top