Question

I have a table collection of ~5 million documents. They look like the following:

 fr to weight

I have 6000 possible values for fr and to. I have created the following code to sum all the weights from all possible fr field to to, e.g.

1 1 sum(all documents with fr=1 to=1)
1 2 sum(all documents with fr=1 to=2)
...

I used pyMongo to write the script, while cells is an array of all possible fr/to fields, and calls is the collection I have my ~5 million documents.

for _from in cells: #pair up each cell with each other cell
     for _to in cells:
             s = calls.aggregate([
                      {
                              '$match':
                              {
                                      "fr": _from,
                                      "to": _to
                              }
                      },
                      {
                              '$group': {'_id': 'null', 'sum': {'$sum':'$weight'}}
                      }])
              if s['result']:
                     fr_to_sum = s['result'][0]['sum']
                     _id = entropy.insert({
                                     'to': _to,
                                     'fr': _from,
                                     'sum': fr_to_sum
                             })

I am using a single machine.

Would another database be better? E.g. some SQL?

UPDATE

The fields fr and to contain integers in the range of 1-100000, while weight is a small float. I try to make all possible combinations between fr and to and sum the values for these combinations, because there might be duplicate documents in the database which have fr and to the same (e.g. fr=1 to=2 weight=0.004 and fr=1 to=2 weight=0.01, so in this case I would like to have in my entropy table fr=1 to=2 weight=0.014.

And just as a sample

 >> db.week_calls.count({'fr':10, 'to':102})
 >> 4

After calling db.week_calls.getIndexes() I got

    {
    "v" : 1,
    "key" : {
        "_id" : 1
    },
    "ns" : "db.week_calls",
    "name" : "_id_"
}

This seems odd, since I created and index on the whole thing with db.calls.ensure_index([("fr",1), ("to",1)])

Was it helpful?

Solution

The basic form of what you are trying to do, and without iterating is this:

db.collection.aggregate([
    { "$group": {
        "_id": { 
           "fr" : "$fr",
           "to" : "$to",
        },
        "sum": { "$sum": "$weight" }
    }}
])

Now that may well blow up on you so you might need to do some looping and "range" the values as in:

db.collection.aggregate([
    { "$match": {
       "fr": { "$gt": 0, "$lte": 10000 },
       "to": { "$gt": 0, "$lte": 10000 }
    }},
    { "$group": {
           "_id": { 
                "fr" : "$fr",
                "to" : "$to",
           },
           "sum": { "$sum": "$weight" }
   }}
])

To segment all of those documents, and then get the next segment:

db.collection.aggregate([
    { "$match": {
       "fr": { "$gt": 10000, "$lte": 20000 },
       "to": { "$gt": 10000, "$lte": 20000 }
    }},
    { "$group": {
             "_id": { 
                  "fr" : "$fr",
                  "to" : "$to"
             },
            "sum": { "$sum": "$weight" }
    }]
])

Whichever solution that works out best for you.

The general point is let $group do the work for you and break that up into manageable chunks where the results may be too big for one fetch.

In future releases aggregate will return a cursor.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top