Backward compatibility of Hadoop Streaming
Question
AFAK, Hadoop Streaming only support text input, which means the data is organized by lines. but the mapper code will become messy if we want backward compatibility, supporting different versions of log lines in the same mapper program wrote in c++.
I used to consider avro or protobuf, but it seems that they are not supported in streaming mode, is it true?
and is there any other solution?
OTHER TIPS
Just for information, hadoop streaming supports binary input/output.
Look for -io rawbytes option.
I created a prototype which was able to consume SequenceFile (I think - it was long ago).
I abandoned the idea because I had to deserialize Java Hadoop *Writables from the stream. And C# BinaryReader uses little-endian encoding, while Java uses big-endian. So mapper became more complicated that it should be.
Anyway, it is possible.