DistributedCache in Hadoop 2.x

https://stackoverflow.com/questions/20497968

31-08-2022
|

Question

I have a problem in DistributedCache in Hadoop 2.x the new API, I found some people working around this issue, but it does not solve my problem example

this solution does not work with me Because i got a NullPointerException when trying to retrieve the data in DistributedCache

My Configuration is as follows:

Driver

    public int run(String[] arg) throws Exception {
        Configuration conf = this.getConf();
        Job job= new Job(conf,"job Name");
        ...
        job.addCacheFile(new URI(arg[1]);

Setup

    protected void setup(Context context)
            throws IOException, InterruptedException {
        Configuration conf = context.getConfiguration();
        URI[] cacheFiles = context.getCacheFiles();
        BufferedReader dtardr = new BufferedReader(new FileReader(cacheFiles[0].toString()));

Here when it starts creating the buffered reader it throws the NullPointerException, this happenning because context.getCacheFiles(); returns always NULL. How to solve this problem, and where is the cache files stored(HDFS, or local file system)

Solution

If you use the local JobRunner in Hadoop (non-distributed mode, as a single Java process), then no local data directory is created; the getLocalCacheFiles() or getCacheFiles() call will return an empty set of results.Can you make sure that you are running your job in a Distributed or Pseudo-Distributed mode.

Hadoop frame work will copy files set in the distributed cache to the local working directory of each task in the job. There are copies of all cached files, placed in the local file system of each worker machine. (They will be in a subdirectory of mapred.local.dir.)

Can you refer this link for understanding more about DistributedCache.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow