Question

I have a 5 million entries in a Mongo DB that look like this:

{
    "_id" : ObjectId("525facace4b0c1f5e78753ea"),
    "productId" : null,
    "name" : "example name",
    "time" : ISODate("2013-10-17T09:23:56.131Z"),
    "type" : "hover",
    "url" : "www.example.com",
    "userAgent" : "curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 openssl/0.9.8r zlib/1.2.5"
}

I need to add to every entry a new field called device which will have either the value desktop or mobile. That means, the goal would be to have the following kind of entries:

{
    "_id" : ObjectId("525facace4b0c1f5e78753ea"),
    "productId" : null,
    "device" : "desktop",
    "name" : "example name",
    "time" : ISODate("2013-10-17T09:23:56.131Z"),
    "type" : "hover",
    "url" : "www.example.com",
    "userAgent" : "curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 openssl/0.9.8r zlib/1.2.5"
}

I am working with the MongoDB Java driver and so far I am doing the following:

DBObject query = new BasicDBObject();
query.put("device", new BasicDBObject("$exists", false)); //some entries already have such field
DBCursor cursor = resource.find(query);
cursor.addOption(Bytes.QUERYOPTION_NOTIMEOUT);
Iterator<DBObject> iterator = cursor.iterator();
int size = cursor.count();

And then I am iterating with a while(iterator.hasNext()), doing an if-else with a huge regular expression I found out there, and depending of the result of such if-else I execute something like:

BasicDBObject newDocument = new BasicDBObject("$set", new BasicDBObject().append("device", "desktop")); //of "mobile", depending on the if-else     
BasicDBObject searchQuery = new BasicDBObject("_id", id);               
resource.getCollection(DatabaseConfiguration.WEBSITE_STATISTICS).update(searchQuery, newDocument);  

However, due to the big amount of data (more than 5 million entries) this takes forever.

Is there a way of doing this with map reduce? So far I've only used MapReduce for counting, so I am not sure if it can be used for other matters.

Was it helpful?

Solution

I found a way which was kind of tricky due to the whole configuration.

After installing Hadoop following this link, I did the following:

  1. Created one class called MongoUpdate, with a method run where I set up all the configuration (like input and output URI) and create a job and configure all the settings. Among those, there is job.setMapperClass(MongoMapper.class)

  2. Created MongoMapper where I have the method map which gets a BSONObject. Here I perform the if-else condition and at the very end I do:

    Text id = new Text(pValue.get("_id").toString()); pContext.write(id, new BSONWritable(pValue));

  3. Class Main whose main method simply instantiates a MongoUpdate class and runs it run method

  4. Export the jar with all the libraries and type on the terminal: hadoop java NameOfTheJar.jar

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top