By default, mrJob stores the key and the value from output in key[tab]output format.

This happens even if the key (or the value) is empty, null, or otherwise not interesting. Suppose my key, value pair is None, {"a":1", "b":1}. Then I get this:

None    {"a":1, "b":2}

Is there a way to suppress the key or the value? I just want this:

{"a":1, "b":2}

BTW, I've already tried this. Am I missing something...?

class MyMrJobClass(MRJob):
    OUTPUT_PROTOCOL = mrjob.protocol.JSONProtocol

    def step1_mapper(self, _, line):
        ...
        yield my_key, my_value

    def step1_reducer(self, key, values):
        for v in values:
            ...
        yield None, my_data

    def steps(self):
        return [
            self.mr(
                mapper=self.step1_mapper,
                reducer=self.step1_reducer,
            ),
        ]

NB: I know that I don't need to overwrite steps for a single-step job. This will eventually be a multistep job, so it's important to build the class that way.

Thanks!

有帮助吗?

解决方案

You can use mrjob.protocol.JSONValueProtocol (notice the Value. See the documentation) as the output protocol instead of mrjob.protocol.JSONProtocol.

The documentation has more information on using custom protocols.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top