Question

I want to implement decision tree ID3/C4.5 on Hadoop. Can anyone through idea how to go ahead.

I am clear about the algorithms but I need to know how to parallelize them.

No correct solution

OTHER TIPS

I would consider approach of having one iteration of attribute selection as one MapReduce job. Following this idea you can assign to each mapper on attribute to check for the information gain, and, on the reduce phase (with single reducer) you can select the best attributes.
I would consider this approach practical if computation of single iteration on one machine (over all attribute) is somewhat longer then job start overhead - which is about 20-40 seconds.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top