How to specify tab as a record separator for hadoop input text file?

https://stackoverflow.com/questions/7271641

18-01-2021
|

Domanda

The input file to my hadoop M/R job is a text file in which the records are separated by tab character '\t' instead of newline '\n'. How can I instruct hadoop to split using the tab character as by default it splits around newlines and each line in the text file is taken as a record.

One way to do it is to use a custom input format class that uses a filter stream to convert all tabs in the original stream to newlines. But this does not look elegant.

Another way would be to use java.util.Scanner with tab as the separator. But I cannot figure out how to use the java.util.Scanner class in the input format classes.

What is the best approach and alternatives?

Soluzione

Values '\r' and '\n' hard-coded in org.apache.hadoop.util.LineReader class, so you can't use TextInputFormat with tab-separated records. But it is not difficult to implement own InputFormat with special LineReader class. The simplest solution is to copy-paste TextInputFormat, LineRecordReader and LineReader classes, move them to your package and change LineReader implementation.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow