Question

I have a 1.5 GB file that contains a serialized HashMap inside it.

I have a setup() method in the Mapper class where I am reading this into a HashMap variable.

It looks like it is able to go to the read method, but immediately throws a java heap space error for the tasks.

I read over many discussions that we may need to set the mapred.child.opts parameter and I am doing that inside the main program code.

I am using: conf.set("mapred.child.java.opts.", "-Xmx1024M");

I even tried to increase the number. Why does it still keep throwing the same error at the time it is trying to read the serialized file into a HashMap variable?

Here is the code in my setup() method:

try {
        test="hello";
        Path pt=new Path("hdfs://localhost:9000/user/watsonuser/topic_dump.tsv");
        FileSystem fs = FileSystem.get(new Configuration());
   }catch(Exception e) {System.out.println("Exception while reading the nameMap 
                          file."); e.printStackTrace();}          
        InputStream is = fs.open(pt);
        ObjectInputStream s = new ObjectInputStream(is);  
        nameMap = (HashMap<String, String>) s.readObject(); 
        s.close();
    }catch(Exception e) {
        System.out.println("Exception while reading the nameMap file."); 
        e.printStackTrace();
    }
Was it helpful?

Solution

As you're using the serialized version of the hash map, and the final output size of the file is 1.5GB, i'm guessing that the amount of memory your JVM is going to need is at least 1.5GB.

You should be able to test this with a small program to load in your file (as you already have), but keep increasing the -Xmx value until you no longer see the memory error - this will be your baseline (you'll probably still need to add some more when running within a hadoop mapper as it has buffer size requirements for spills sorting etc.

Do you also know how many bins and items are represented in this hash map? The implementation of HashMap is just an array of bins with linked entry items that hash to that bin number. The number of bins also has to be a power of two, so as you put more and more items in your map, the memory requirements for the actual backing array double when the map reaches its threshold value / load factor (0.75). With this in mind, i imagine the problems your seeing is that such a large hash map (1.5GB serialized) will require an as large, if not larger memory footprint when deserialized into memory

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top