Pregunta

I have two MapReduce jobs, the output of the first Reducer is the input of the second Mapper :

Map1 -> Reduce1 -> Map2 -> Reduce2

For now Map2 reads from the files outputted by Reduce1. So Map1 -> Reduce1 and Map2 -> Reduce2 are independant.

It works, but it would be easier and I think more optimized if the output of Reduce1 was directly the input of Map2.

Is there a way to do that ? In this case Map2 would be just an identity mapper so it would be even better if I could do :

Map1 -> Reduce1 -> Reduce2

Reduce1, Map2 and Reduce2 have the same input and output types.

Thanks !

¿Fue útil?

Solución

According to my understanding these points i can tell you (may or may not help you, correct me if i am wrong):

1) map 1-> reduce 1-> directly to mapper2 : for optimizations are addressed in spark cluster computing framework (using in-memory computations, avoiding unnecessary read/writes to hdfs).

2) if you want something like reducer1 ->reducer2 . you have to think how you can write the logic in one reducer itself , but the problem here is its all depends on your requirement i mean the aggregation on which keys you want to perform (in more detail : reducer1 receives same set of key, on which only u can act the task of next aggregation).

3) In Hadoop the protocol is like this only : map --> then aggregation , if any next aggregation , it has to come from a Userdefinedmapper/IdentityMapper.

hope this helps :)

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top