문제

I have a 1.5 GB file that contains a serialized HashMap inside it.

I have a setup() method in the Mapper class where I am reading this into a HashMap variable.

It looks like it is able to go to the read method, but immediately throws a java heap space error for the tasks.

I read over many discussions that we may need to set the mapred.child.opts parameter and I am doing that inside the main program code.

I am using: conf.set("mapred.child.java.opts.", "-Xmx1024M");

I even tried to increase the number. Why does it still keep throwing the same error at the time it is trying to read the serialized file into a HashMap variable?

Here is the code in my setup() method:

try {
        test="hello";
        Path pt=new Path("hdfs://localhost:9000/user/watsonuser/topic_dump.tsv");
        FileSystem fs = FileSystem.get(new Configuration());
   }catch(Exception e) {System.out.println("Exception while reading the nameMap 
                          file."); e.printStackTrace();}          
        InputStream is = fs.open(pt);
        ObjectInputStream s = new ObjectInputStream(is);  
        nameMap = (HashMap<String, String>) s.readObject(); 
        s.close();
    }catch(Exception e) {
        System.out.println("Exception while reading the nameMap file."); 
        e.printStackTrace();
    }
도움이 되었습니까?

해결책

As you're using the serialized version of the hash map, and the final output size of the file is 1.5GB, i'm guessing that the amount of memory your JVM is going to need is at least 1.5GB.

You should be able to test this with a small program to load in your file (as you already have), but keep increasing the -Xmx value until you no longer see the memory error - this will be your baseline (you'll probably still need to add some more when running within a hadoop mapper as it has buffer size requirements for spills sorting etc.

Do you also know how many bins and items are represented in this hash map? The implementation of HashMap is just an array of bins with linked entry items that hash to that bin number. The number of bins also has to be a power of two, so as you put more and more items in your map, the memory requirements for the actual backing array double when the map reaches its threshold value / load factor (0.75). With this in mind, i imagine the problems your seeing is that such a large hash map (1.5GB serialized) will require an as large, if not larger memory footprint when deserialized into memory

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top