Question

I working on a new project and I have to develop an inverted index that can be stored in a file database (such as CouchDB). I am coding in Ruby 1.8.7.

This is the format of the inverted index:

{
    "en": {
        "#linux": {
            "re": 144,
            "patch": 142,
            "1": 55,
            "to": 53
            },
        "#something": {
            "word": 20
            }
        },
    "fr": {},
    "es": {}
}

I want a way that using something like CouchDB, I can create the entries by a series of checks like the following:

  • If the second hash key (i.e. #linux) has not been created, then create it
  • If the third hash key (i.e. patch) has not been created, then create it and set it's value to 1
  • Repeat and increase the count (the furthest right values) by one every time the same word appears again ['en']['#linux'] or whatever the variables will be.

I've done the problem fine just using basic hashes, but having these in memory aren't going to be very nice when I set my script to go through about 1TB or more of text.

Selected Answer

The selected answer works perfectly for this. The only difference is a few slight changes to the syntax and works as follows:

@db.collection.update({"_id" => lang}, {"$inc" => {"#{tag}.#{word}" => 1}}, { :upsert => true })
Était-ce utile?

La solution

CouchDB is not going to be your best tool for the job. Particularly it is not suited for fast updates that don't grow document (your increments). Upon each update it creates a new version of document on disk, so your db is going to be pretty massive and disk is going to be busy.

I would recommend looking at MongoDB. It has fast in-place updates, indexes and richer query language. Example:

db.collection.update({_id: 'en'},
                     {$inc: {'linux.re': 1}},
                     true);

This will find document with id 'en' and increments its ['linux']['re'] field. If a document isn't found, or ['linux'] doesn't exist, or ['linux']['re'] doesn't exist, they will be automatically created. This is one of my most favourite features of this DB.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top