MapReduce: How can I output key/value pair without newlines?

https://stackoverflow.com/questions/19070640

29-06-2022
|

Question

I am using a 0 reduce approach to my problem. I wish to preprocess data from one file and then to write it out as another file, but with no new lines and tab delimeters? How can I output my map job that has processed my data with the same file format it came in minus the preprocess. That is, I have something like this:

Preprocess:

<TITLE> Herp derp </Title> I am a major general

Post Process:

Herp 
Derp 
I 
am 
a
major
general

What I want it to do is this:

Herp Derp I am a major general

I believe the issue is with this line of code:

job.setOutputFormatClass(TextOutputFormat.class);

However, when I tried, quite naively to do something like:

job.setOutputFormatClass(null);

It obviously would not work. Is there an format class that is provided that I can use to do this? If not, how could I write my own class to just output everything as I want? I am new to hadoop and map reduce.

I have included my map function below. I do not want to use reduce as it would sort between the map and reducer.

        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {

            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);


            while (tokenizer.hasMoreTokens()) {

                word.set(tokenizer.nextToken());

                //Did preprocessing here, irrelevant to my problem

                context.write(word, null);
            }
        }

Also, I have also googled this and read the apache hadoop api to see if I can gleam an answer.

Solution

On your mapper class, instead of parsing your line into individual words and writing them out, try sending the entire line to the

context.write(word, null);

That way it is keeping the entire string you are originally working with together, instead of sending out the line piece by piece.

So, cut your string apart for the preprocess work, then put it back together when you send it out with the context.write command.

OTHER TIPS

If your mapper is writing multiple records containing the individual tokens from a single input line, then you will absolutely need a reducer to group those tokens back together into a single line for output. You can't do this without a reducer.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow