Looking around it looks like there are maybe three options that i've found for solving this problem:
- Character reference “” is an invalid XML character - similar SO question
- Unicode characters/Ctrl G or Ctrl A as TextOutputFormat (Hadoop) delimiter
The possible solutions as detailed in the link above are:
- You can Base64 encode the separator character. You then need to create a custom TextOutputFormat that overrides the getRecordWriter method and decodes the Base64 encoded separator.
- Create a custom TextOutputFormat again, except change the default separator character from a tab.
- Provide the delimiter through an XML resource file. You can specify a custom resource file using the addResource() method of the jobs Configuration.