I am using a 0 reduce approach to my problem. I wish to preprocess data from one file and then to write it out as another file, but with no new lines and tab delimeters? How can I output my map job that has processed my data with the same file format it came in minus the preprocess.
That is, I have something like this:
Preprocess:
<TITLE> Herp derp </Title> I am a major general
Post Process:
Herp
Derp
I
am
a
major
general
What I want it to do is this:
Herp Derp I am a major general
I believe the issue is with this line of code:
job.setOutputFormatClass(TextOutputFormat.class);
However, when I tried, quite naively to do something like:
job.setOutputFormatClass(null);
It obviously would not work. Is there an format class that is provided that I can use to do this? If not, how could I write my own class to just output everything as I want? I am new to hadoop and map reduce.
I have included my map function below. I do not want to use reduce as it would sort between the map and reducer.
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
//Did preprocessing here, irrelevant to my problem
context.write(word, null);
}
}
Also, I have also googled this and read the apache hadoop api to see if I can gleam an answer.