Hadoop Job fails with Native SimString C code on Large Data

https://stackoverflow.com/questions/21713531

10-10-2022
|

Question

I am having issues while running job with large data (~15G) on hadoop cluster using SimString Native library. However job runs fine on medium/small dataset(~200M). During the job SimString first create a file based database for matching strings and then perform matching on a given String against strings in database. After job is completed it deletes the file-based database. The job runs in a multi-threaded(100 threads) fashion.

Around 22 mappers are created for job execution, each running 100 threads. Overall, RAM of machine is 4G

Error Logs goes like:

14/02/12 00:15:53 INFO mapred.JobClient:  map 0% reduce 0%
14/02/12 00:16:13 INFO mapred.JobClient:  map 4% reduce 0%
14/02/12 00:16:24 INFO mapred.JobClient: Task Id : attempt_201402091522_0059_m_000001_0, Status : FAILED
java.lang.Throwable: Child Error
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 134.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

attempt_201402091522_0059_m_000001_0: #
attempt_201402091522_0059_m_000001_0: # A fatal error has been detected by the Java Runtime Environment:
attempt_201402091522_0059_m_000001_0: #
attempt_201402091522_0059_m_000001_0: #  SIGSEGV (0xb) at pc=0x00007f6f1cd8827b, pid=21146, tid=140115055609600
attempt_201402091522_0059_m_000001_0: #
attempt_201402091522_0059_m_000001_0: # JRE version: 6.0_45-b06
attempt_201402091522_0059_m_000001_0: # Java VM: Java HotSpot(TM) 64-Bit Server VM (20.45-b01 mixed mode linux-amd64 compressed oops)
attempt_201402091522_0059_m_000001_0: # Problematic frame:
attempt_201402091522_0059_m_000001_0: # C  [libSimString.so+0x6c27b][thread 140115045103360 also had an error]
attempt_201402091522_0059_m_000001_0:   cdbpp::cdbpp_base<cdbpp::murmurhash2>::get(void const*, unsigned long, unsigned long*) const+0x16f
attempt_201402091522_0059_m_000001_0: #
attempt_201402091522_0059_m_000001_0: # An error report file with more information is saved as:
attempt_201402091522_0059_m_000001_0: # /app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201402091522_0059/attempt_201402091522_0059_m_000001_0/work/hs_err_pid21146.log
attempt_201402091522_0059_m_000001_0: [thread 140115070318336 also had an error]
attempt_201402091522_0059_m_000001_0: [thread 140114919028480 also had an error]
attempt_201402091522_0059_m_000001_0: [thread 140115089229568 also had an error]
attempt_201402091522_0059_m_000001_0: #
attempt_201402091522_0059_m_000001_0: # If you would like to submit a bug report, please visit:
attempt_201402091522_0059_m_000001_0: #   http://java.sun.com/webapps/bugreport/crash.jsp
attempt_201402091522_0059_m_000001_0: # The crash happened outside the Java Virtual Machine in native code.
attempt_201402091522_0059_m_000001_0: # See problematic frame for where to report the bug.

The problem looks like to be caused in Native code as seen here:

cdbpp::cdbpp_base<cdbpp::murmurhash2>::get(void const*, unsigned long, unsigned long*) const+0x16f

But, I donot understand why this is not creating any issue in small dataset. I am running the following hadoop command for execution:

hadoop jar hadoopjobs/job.jar Job -D mapred.child.java.opts=-Xss500k -D mapred.reduce.child.java.opts=-Xmx200m -files file1,file2,/home/hduser/libs/libSim/x64/libSimString.so -libjars /home/hduser/libs/Simstring.jar /datasources/XXX/spool/input datasources/XXX/spool/output

References: SimString library: http://www.chokkan.org/software/simstring/

Source Code of cdbpp::cdbpp_base::get(void const*, unsigned long, unsigned long*) const+0x16f: https://gitorious.org/copy-paste/copy-paste/commit/5d9c6b5b29fb2b1b8dd571260e7d50d9c42db9f9

Solution 2

As I identified earlier, the problem was in calling below method inside java:

 cdbpp::cdbpp_base<cdbpp::murmurhash2>::get(void const*, unsigned long, unsigned long*) const+0x16f

I was using 100 threads per mapper and in total I had 22 out of which 2 used to run in parallel. Due to static reader which used to call the above method without "synchronization" was making this problem. So surrounding this method call with synchronization block solved the problem.

OTHER TIPS

The problem is likely not with your Murmur3 hashing, but rather with the native library and how it allocates memory.

I'm not experienced with JNI-calls, but they are problematic when it comes to memory use (every such call allocates stack and heap-space). One can not be sure that the GC can trigger correctly (read the horror stories about GZipInputStream).

You say you have 22*100 threads created, each one likely allocating some stack for JNI-calls, and just 4Gb memory in the box. The machine seems to be quite crowded, and I guess it's CPU/memory access that is the constraint here, not long external waits (where only few threads are really active in parallell)?

What happends when you lower the amount of threads radically? How is the SimStrings library meant to be used? Does it have an internal threading model which should be respected (ie just letting one thread make it's queries at once?).

I'm afraid the JNI is quite singlethreaded.

Read more about how native calls allocate memory.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow