Pregunta

i am very new to M/R programs..i have a file in HDFS with data in this structure

EmpId,EmpName,Dept,Salary,

1231,userName1,Dept1,5000
1232,userName2,Dept2,6000
1233,userName3,Dept3,7000
.
.
.........................

now i want to find the name of the employee who earns highest salary

i have written a map reduce to find the highest salary.in my mapper class i have emitted the output like this

output.collect("max value",salary of the emplyee);

In the reducer i found out the max value of the key "max value".now i want to use this value in a mapper and find the names of the employee who earns maximum salary..how can i send the reducer output to a mapper as input?is this a good approach to accomplish my task?any other suggestions?

¿Fue útil?

Solución

I would make the map emit the full tuple of the max salary. For that, create a class (for the value) that implements Writable interface (http://hadoop.apache.org/docs/r1.2.0/api/org/apache/hadoop/io/Writable.html). Maybe TupleWritable suits your needs (not much complex).

Since you will have 1 value emited per map, network is not an issue and seems fine to receive all tuple data in the reducer. Your reducer will just have to filter the top from the "max" values.

For more complex problems, you will have to think about chaining jobs (http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining)

Otros consejos

I can suggest following solution

1. Find the max salary using your mapreduce job

2. Read the max salary from hdfs (it should be in the file in output folder of your job)

3. Save the max salary two configuration, say `configuration.set("max.salary", maxSalary);`

4. Create new mapper-only job. The mapper of this job should read maxSalary value from the configuration in the setup method and filter out employers with salary equal to the maxSalary in map method. Pass your data to this job.

As the result, you'll

P.S. But as the better way, I'll recommend you to use HIVE or PIG for such kind of tasks, because if they doesn't involve complicated math/buseness logic would be much easier to implement them in high level instruments like hive and pig (and some other).

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top