I'm trying to understand this code for canopy clustering. The purpose of these two classes (one map, one reduce) is to find canopy centers. My problem is that I don't understand the difference between the map and reduce functions. They are nearly the same.
So is there a difference? Or am I just repeating the same same process again in the reducer?
I think the answer is that there is a difference in how the map and reduce functions handle the code. They perform different actions on the data even with similar code.
So can someone please explain the process of the map and reduce when we try to find the canopy centers?
I know for example that a map might looks like this -- (joe, 1) (dave, 1) (joe, 1) (joe, 1)
and then the reduce will go like this: --- (joe, 3) (dave, 1)
Does the same type of thing happen here?
Or maybe I'm performing the same task twice?
Thanks so much.
map function:
package nasdaq.hadoop;
import java.io.*;
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.util.*;
public class CanopyCentersMapper extends Mapper<LongWritable, Text, Text, Text> {
//A list with the centers of the canopy
private ArrayList<ArrayList<String>> canopyCenters;
@Override
public void setup(Context context) {
this.canopyCenters = new ArrayList<ArrayList<String>>();
}
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//Seperate the stock name from the values to create a key of the stock and a list of values - what is list of values?
//What exactly are we splitting here?
ArrayList<String> stockData = new ArrayList<String>(Arrays.asList(value.toString().split(",")));
//remove stock and make first canopy center around it canopy center
String stockKey = stockData.remove(0);
//?
String stockValue = StringUtils.join(",", stockData);
//Check wether the stock is avaliable for usage as a new canopy center
boolean isClose = false;
for (ArrayList<String> center : canopyCenters) { //Run over the centers
//I think...let's say at this point we have a few centers. Then we have our next point to check.
//We have to compare that point with EVERY center already created. If the distance is larger than EVERY T1
//then that point becomes a new center! But the more canopies we have there is a good chance it is within
//the radius of one of the canopies...
//Measure the distance between the center and the currently checked center
if (ClusterJob.measureDistance(center, stockData) <= ClusterJob.T1) {
//Center is too close
isClose = true;
break;
}
}
//The center is not smaller than the small radius, add it to the canopy
if (!isClose) {
//Center is not too close, add the current data to the center
canopyCenters.add(stockData);
//Prepare hadoop data for output
Text outputKey = new Text();
Text outputValue = new Text();
outputKey.set(stockKey);
outputValue.set(stockValue);
//Output the stock key and values to reducer
context.write(outputKey, outputValue);
}
}
}
Reduce function:
package nasdaq.hadoop;
import java.io.*;
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
public class CanopyCentersReducer extends Reducer<Text, Text, Text, Text> {
//The canopy centers list
private ArrayList<ArrayList<String>> canopyCenters;
@Override
public void setup(Context context) {
//Create a new list for the canopy centers
this.canopyCenters = new ArrayList<ArrayList<String>>();
}
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for (Text value : values) {
//Format the value and key to fit the format
String stockValue = value.toString();
ArrayList<String> stockData = new ArrayList<String>(Arrays.asList(stockValue.split(",")));
String stockKey = key.toString();
//Check wether the stock is avaliable for usage as a new canopy center
boolean isClose = false;
for (ArrayList<String> center : canopyCenters) { //Run over the centers
//Measure the distance between the center and the currently checked center
if (ClusterJob.measureDistance(center, stockData) <= ClusterJob.T1) {
//Center is too close
isClose = true;
break;
}
}
//The center is not smaller than the small radius, add it to the canopy
if (!isClose) {
//Center is not too close, add the current data to the center
canopyCenters.add(stockData);
//Prepare hadoop data for output
Text outputKey = new Text();
Text outputValue = new Text();
outputKey.set(stockKey);
outputValue.set(stockValue);
//Output the stock key and values to reducer
context.write(outputKey, outputValue);
}
}
**Edit -- more code and explanation
Stockkey is the key value representing stocks. (nasdaq and things like that)
ClusterJob.measureDistance():
public static double measureDistance(ArrayList<String> origin, ArrayList<String> destination)
{
double deltaSum = 0.0;
//Run over all points in the origin vector and calculate the sum of the squared deltas
for (int i = 0; i < origin.size(); i++) {
if (destination.size() > i) //Only add to sum if there is a destination to compare to
{
deltaSum = deltaSum + Math.pow(Math.abs(Double.valueOf(origin.get(i)) - Double.valueOf(destination.get(i))),2);
}
}
//Return the square root of the sum
return Math.sqrt(deltaSum);