Is it possible to associate an instance of an object with one file while it's being mapped by a map-only mapred Job?

StackOverflow https://stackoverflow.com/questions/19014834

Question

I want to use a HashSet that exists/works against one file while it's being mapped, and then is reset/recreated when the next file is being mapped. I have modified TextInputFormat to override isSplitable to return false, so that the file is not split up and is processed as a whole by Mappers. Is it possible to do something like this? Or is there another way to do fewer writes to the Accumulo table?

Let me start out with I do not believe I want a global variable. I just want to ensure uniqueness and thus write fewer mutations to my Accumulo table.

My project is to convert the functionality of the Index.java file from the shard example from a linear accumulo client program to one that uses mapreduce functionality, while still creating the same table in Accumulo. It needs to be mapreduce because that's the buzzword, and in essence it would run faster than a linear program against terabytes of data.

Here is the Index code for reference: http://grepcode.com/file/repo1.maven.org/maven2/org.apache.accumulo/examples-simple/1.4.0/org/apache/accumulo/examples/simple/shard/Index.java

This program uses a BatchWriter to write Mutations to Accumulo and does it on a per file basis. To ensure it doesn't write more mutations than necessary and to ensure uniqueness (though I do believe Accumulo eventually merges same keys through compaction), Index.java has a HashSet that is used to determine if a word has been run across before. This is all relatively simple to understand.

Moving to a map-only mapreduce job is more complex.

This was my attempt at mapping, which seems to kinda work from the partial output I've seen the Accumulo table, but runs really really slow compared to the linear program Index.java

public static class MapClass extends Mapper<LongWritable,Text,Text,Mutation> {
        private HashSet<String> tokensSeen = new HashSet<String>();
        @Override
        public void map(LongWritable key, Text value, Context output) throws IOException {
            FileSplit fileSplit = (FileSplit)output.getInputSplit();
            System.out.println("FilePath " + fileSplit.getPath().toString());
            String filePath = fileSplit.getPath().toString();
            filePath = filePath.replace("unprocessed", "processed");

            String[] words = value.toString().split("\\W+");

            for (String word : words) {
                Mutation mutation = new Mutation(genPartition(filePath.hashCode() % 10));
                word = word.toLowerCase();
                if(!tokensSeen.contains(word)) {
                    tokensSeen.add(word);
                    mutation.put(new Text(word), new Text(filePath), new Value(new byte[0]));
                }

                try {
                    output.write(null, mutation);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }
        }
    }

And the slow problem might be the fact that I'm running all of this on a test instance, a single-node instance of Hadoop with ZooKeeper and Accumulo on top. If that's the case, I just need to find a solution for uniqueness.

Any help or advice provided is greatly appreciated.

Was it helpful?

Solution

Mapper has setup and cleanup methods that you can override to handle this kind of thing more cleanly. setup is called once, then map is called many times (once for every record), then cleanup is called once at the end. The idea would be that you create the HashSet in the setup method, build it up in map, and commit everything in cleanup, or periodically flush in the some of the calls to map if necessary.

However, you will almost certainly not see any improvement in runtime until you move to a real cluster. A single-node test instance has almost no benefits over a simple linear program, except that the same code will run much faster once you get a real hadoop cluster.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top