I figured out the solution to this problem thanks to a very helpful hint in hadoop-user mailing list.
In short we need to override how Hadoop IO writes/reads data to/from standard stream. To do this:
- Extend
InputWriter
,OutputReader
, also provide your ownInputFormat
andOutputFormat
such that you completely control how bytes are written to and read from stream. - Extend
IdentifierResolver
class in order to tell Hadoop to use your ownInputWriter
andOutputReader
.
Use your IdentifierResolver
, InputFormat
, and OuputFormat
as the following:
hadoop jar <streaming jar location>
-D stream.io.identifier.resolver.class=my.own.CustomIdentifierResolver
-libjars <my input format jar>
-mapper /bin/cat
-inputformat my.own.CustomInputFormat
-outputformat my.own.CustomOutputFormat
<other options ...>
The patch provided in feature (not-merged) MAPREDUCE-5018 is a great source on how to do this and can be customized to fit one's needs.