Hadoop - textouputformat.separator use ctrlA ( ^A )

https://stackoverflow.com/questions/13465795

30-11-2021
|

Question

I'm trying to use ^A as the separator between Key and Value in my reduce output files. I found that the config setting "mapred.textoutputformat.separator" is what I want and this correctly switches the separator to ",":

conf.set("mapred.textoutputformat.separator", ",");

But it can't handle the ^A character:

conf.set("mapred.textoutputformat.separator", "\u0001");

throws this error:

ERROR security.UserGroupInformation: PriviledgedActionException as:user (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.RuntimeException: org.xml.sax.SAXParseException; lineNumber: 68; columnNumber: 94; Character reference "&#

I found this ticket https://issues.apache.org/jira/browse/HADOOP-7542 and see they tried to fix this but reverted the patch due to XML1.1 concerns.

SO I'm wondering if anyone has had success setting the separator to ^A (seems pretty common), using an easy work around. Or if I should just settle and use tab separator.

Thanks!

I'm running Hadoop 0.20.2-cdh3u5 on CentOS 6.2

Solution

Looking around it looks like there are maybe three options that i've found for solving this problem:

Character reference “&#1” is an invalid XML character - similar SO question
Unicode characters/Ctrl G or Ctrl A as TextOutputFormat (Hadoop) delimiter

The possible solutions as detailed in the link above are:

You can Base64 encode the separator character. You then need to create a custom TextOutputFormat that overrides the getRecordWriter method and decodes the Base64 encoded separator.
Create a custom TextOutputFormat again, except change the default separator character from a tab.
Provide the delimiter through an XML resource file. You can specify a custom resource file using the addResource() method of the jobs Configuration.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow