HDFS FileSplit locations

https://stackoverflow.com/questions/18863799

29-06-2022
|

Question

I have a cluster with an installation of hadoop-2.1.0-beta. Is there a way to learn where each filesplit is located in my cluster? What I am looking for is a list such as the following

filesplit_0001 node1
filesplit_0002 node4
...

edit: I know that such a list is available in Microsoft Azure.

Solution

The fsck tool provides an easy way to find out which blocks are in any particular file. For example:

% hadoop fsck <path> -files -blocks -locations -racks

Reference : Hadoop Command Line Guide.

Edit:

An input split is a chunk of the input that is processed by a single map. Each map processes a single split. Each split is divided into records, and the map processes each record a key-value pair in turn. Splits and records are logical but HDFS blocks are physical.

An InputSplit has a length in bytes and a set of storage locations, which are just hostname strings. A split doesn’t contain the input data; it is just a reference to the data.

You can get InputSplit instance in map method.

InputSplit inputSplit=context.getInputSplit(); //Input split instance 
String[] splitLocations = inputSplit.getLocations();

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow