Question

During the development of our map-reduce jobs our MR code generates useful diagnostic data structures independently of the data being map-reduced. Is there an easy way to get these data out to the code that called mapReduce or to persist them in Mongo? Just writing to the log file is turning out to be very sub-optimal as (a) there is a lot of data there already and (b) our diagnostic info is highly structured and, in fact, we'd like to run queries against it.

My investigation so far suggests that MR data structures are passed by value (via serialization) so any in-memory data structures are lost, including those hooked to the "global" scope. The namespaces are isolated from the main JS server-side namespace so dbeval can't seem to reach them (or, at least, I don't know where to look). Last but not least, although all the database objects and functions are present, 10gen is generating (confusing) error messages to prevent their use, e.g., about coll.insert not being a function while typeof coll.insert === 'function' is true.

To be clear, I'm interested in doing this for development in a single node, because the logging/debugging support in MongoDB is pretty limited. This type of side-effects are not good in production environments.

Was it helpful?

Solution

As surmised, it is not possible (as at MongoDB 2.2) to access another DB from within the Map/Reduce functions. Aside from potential performance impact, there is also the possibility of creating deadlocks and other unwanted side-effects.

Unfortunately that leaves print() to the mongo log as the only "out of band" output option.

Depending on your diagnostic output, one approach to try would be:

  • add a unique marker that would allow you to identify the output (or even the output run) in the log output

  • serialize your output using tojson() so it is logged with some parseable structure and ideally emitted on a single line when you print()

  • write a script to tail the mongod.log log for lines matching with your unique marker and insert those into another collection for reporting

Example of code that will run from within a M/R function:

var diag = {
    'run' : diagrun,
    'phase': 'map',
    'key'  : z
}   
print("MAPDIAG:" + tojson(diag));

Example output:

$ tail -f mongo.log | grep "^MAPDIAG"
MAPDIAG:{ "run" : "20120824", "phase" : "map", "key" : "dog" }
MAPDIAG:{ "run" : "20120824", "phase" : "map", "key" : "cat" }
MAPDIAG:{ "run" : "20120824", "phase" : "map", "key" : "cat" }
MAPDIAG:{ "run" : "20120824", "phase" : "map", "key" : "mouse" }
MAPDIAG:{ "run" : "20120824", "phase" : "map", "key" : "cat" }
MAPDIAG:{ "run" : "20120824", "phase" : "map", "key" : "dog" }
MAPDIAG:{ "run" : "20120824", "phase" : "reduce", "key" : "cat", "total" : 3 }
MAPDIAG:{ "run" : "20120824", "phase" : "reduce", "key" : "dog", "total" : 2 }
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top