Question

I am using MongoDB and I ended up with two Collections (unintentionally).

The first Collection (sample) has 100 million records (Tweets) with the following structure:

{
"_id" : ObjectId("515af34297c2f607b822a54b"),
"text" : "bla bla ",
"id" : NumberLong("314965680476803072"),
"user" : 
       {
        "screen_name" : "TheFroooggie",
        "time_zone" : "Amsterdam",
       },
}

The second Collection (users) with 30 Million records of unique users from the tweet collection and it looks like this

{ "_id" : "000000_n", "target" : 1, "value" : { "count" : 5 } }

where the _id in the users collection is the user.screen_name from the tweets collection, the target is their status (spammer or not) and finally the value.count is the number a user appeared in our first collection (sample) collection (e.g. number of captured tweets)

Now I'd like to make the following query:

I'd like to return all the documents from the sample collection (tweets) where the user has the target value = 1

In other words, I want to return all the tweets of all the spammers for example.

Was it helpful?

Solution

As you receive the tweets you could upsert them into a collection. Using the author information as the key in the "query" document portion of the update. The update document could utilize the $addToSet operator to put the tweet into a tweets array. You'll end up with a collection that has the author and an array of tweets. You can then do your spammer classification for each author and have their associated tweets.

So, you would end up doing something like this:

db.samples.update({"author":"joe"},{$addToSet:{"tweets":{"tweet_id":2}}},{upsert:true})

This approach does have the likely drawback of growing the document past its initially allocated size on disk which means it would be moved and expanded on disk. You would likely incur some penalty for index updating as well.

You could also take an approach of storing a spam rating with each tweet document and later pulling those based on user id.

As others have pointed out, there is nothing wrong with setting up the appropriate indexes and using a cursor to loop through your users pulling their tweets.

The approach you choose should be based on your intended access pattern. It sounds like you are in a good place where you can experiment with several different possible solutions.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top