Question

Basically the collection output of an elaborate aggregate pipeline for a very large dataset is similar to the following:

{
    "_id" : {
        "clienta" : NumberLong(460011766),
        "clientb" : NumberLong(2886729962)
    },
    "states" : [ 
        [ 
            "fixed", "fixed.rotated","fixed.rotated.off"

        ]
    ],
    "VBPP" : [ 
        244, 
        182,
        184,
        11,
        299,

    ],
    "PPF" : 72.4,   
}

The intuitive, albeit slow, way to update these fields to be calculations of their former selves (length and variance of an array) with PyMongo before converting to arrays is as follows:

records_list = []

cursor = db.clientAgg.find({}, {'_id' : 0,
                                      'states' : 1, 
                                      'VBPP' : 1, 
                                      'PPF': 1})
for record in cursor:
    records_list.append(record)

for dicts in records_list:
        dicts['states'] = len(dicts['states'])
        dicts['VBPP']  = np.var(dicts['VBPP'])

I have written various forms of this basic flow to optimize for speed, but bringing in 500k dictionaries in memory to modify them before converting them to arrays to go through a machine learning estimator is costly. I have tried various ways to update the records directly via a cursor with variants of the following with no success:

cursor = db.clientAgg.find().skip(0).limit(50000)

def iter():
    for item in cursor:
        yield item

l = []
for x in iter():
    x['VBPP']  = np.var(x['VBPP']) 
    # Or    
    # db.clientAgg.update({'_id':x['_id']},{'$set':{'x.VBPS': somefunction as above }},upsert=False, multi=True)

I also unsuccessfully tried using Mongo's usual operators since the variance is as simple as subtracting the mean from each element of the array, squaring the result, then averaging the results.

If I could successfully modify the collection directly then I could utilize something very fast like Monary or IOPro to load data directly from Mongo and into a numpy array without the additional overhead.

Thank you for your time

Was it helpful?

Solution

MongoDB has no way to update a document with values calculated from the document's fields; currently you can only use update to set values to constants that you pass in from your application. So you can set document.x to 2, but you can't set document.x to document.y + document.z or any other calculated value.

See https://jira.mongodb.org/browse/SERVER-11345 and https://jira.mongodb.org/browse/SERVER-458 for possible future features.

In the immediate future, PyMongo will release a bulk API that allows you to send a batch of distinct update operations in a single network round-trip which will improve your performance.

Addendum:

I have two other ideas. First, run some Javascript server-side. E.g., to set all documents' b fields to 2 * a:

db.eval(function() {
    var collection = db.test_collection;
    collection.find().forEach(function(doc) {
        var b = 2 * doc.a;
        collection.update({_id: doc._id}, {$set: {b: b}});
    });
});

The second idea is to use the aggregation framework's $out operator, new in MongoDB 2.5.2, to transform the collection into a second collection that includes the calculated field:

db.test_collection.aggregate({
    $project: {
        a: '$a',
        b: {$multiply: [2, '$a']}
    }
}, {
    $out: 'test_collection2'
});

Note that $project must explicitly include all the fields you want; only _id is included by default.

For a million documents on my machine the former approach took 2.5 minutes, and the latter 9 seconds. So you could use the aggregation framework to copy your data from its source to its destination, with the calculated fields included. Then, if desired, drop the original collection and rename the target collection to the source's name.

My final thought on this, is that MongoDB 2.5.3 and later can stream large result sets from an aggregation pipeline using a cursor. There's no reason Monary can't use that capability, so you might file a feature request there. That would allow you to get documents from a collection in the form you want, via Monary, without having to actually store the calculated fields in MongoDB.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top