Domanda

I'm wondering what's the best practice to implement M/R filter which will do the following:

Let's say there is a key value pair:

Key: IntegerID, Value: n integer values. For Example:

1 | 1 2 2 3 3 0 6

2 | 0 3 4 5 6 7 8

3 | 1 5 2 6 2 2 6

I would like to filter (exclude) columns that contain '0'

Desired output:

1 | 2 2 3 3 6

2 | 3 4 5 6 8

3 | 5 2 6 2 6

Thanks

È stato utile?

Soluzione

It doesn't look like the best fit for M/R at all, since a reducer will need to see all values from all rows to make a "decision" regarding a column.

I'd be interested to see what the actual problem is and why you decided to go with M/R in the first place.

If I had to do this in M/R

I'd have the mapper separate each row into a ([col#,rowkey],value) pairs - the col# is so all data from one column will end up in one reducer (who can decide whether to ditch the column or not). The row_id will be used to combine the results from all the reducers back to a single row.

For example the first row from your example will be sent from mapper to reducer as:

([0,1],1)

([1,1],2)

([2,1],2)

([3,1],3)

([4,1],3)

([5,1],0)

([6,1],6)

Then you'll need a partitioner that will partition the map output to reducers based on the column number (i.e. the first element of the [col#,rowkey]) pair. Also write a custom comparator, so the map results will arrive to the reducer sorted by the value.

This way the reducer will just need to look at the first value - if its 0, we know the column contains a 0 and the reducer can exit without doing anything else. If its not 0, it should act as an identity reducer - just output all the results from the mapper as is.

Now you need a second M/R job to put it back together in the original format: The mapper will not do anything. A custom partitioner will send all the results with the same rowkey to the same reducer. You can use a total-order partitioner if preserving row order in the final result set is important. A custom comparator will order the data in each partition by rowkey and the col#.

The reducer will write all values for the same row one by one in one string and then output it as a line.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top