Question

AFAK, Hadoop Streaming only support text input, which means the data is organized by lines. but the mapper code will become messy if we want backward compatibility, supporting different versions of log lines in the same mapper program wrote in c++.

I used to consider avro or protobuf, but it seems that they are not supported in streaming mode, is it true?

and is there any other solution?

Was it helpful?

Solution

Other input/output formats can also be used along with Hadoop Streaming.

Avro support had been added for Hadoop Streaming. See AVRO-808 & AVRO-830. Also this Thread might be useful.

I could not find InputFormat and OutputFormat classes for ProtoBuf. So, they need to be custom created.

OTHER TIPS

Just for information, hadoop streaming supports binary input/output.

Look for -io rawbytes option.

I created a prototype which was able to consume SequenceFile (I think - it was long ago).

I abandoned the idea because I had to deserialize Java Hadoop *Writables from the stream. And C# BinaryReader uses little-endian encoding, while Java uses big-endian. So mapper became more complicated that it should be.

Anyway, it is possible.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top