Domanda


I am new to Couchdb and currently I hv been stuck on a small(probably...) problem on using couchdb map-reduce function and since I can not find any relevant infos online. I hv to ask help for myself here.
Basically the scenario is like this: I am using a map function to count the times of a certain word that appears in a certain doc. And do the emit simply like:

emit(word,1)

In this way, If I need to get the sum value of each word so that to figure out how many times each word appears in all the docs. I could simply code the reduce function like:

function(key, values, rereduce)
{
    return sum(values);
}

But my real need is to only return sum(values) that large than 3000 (to find out word that appears more than 3000 times in all the docs). So I try to do like this:

function(key, values, rereduce)
{
    if(sum(values)>3000)
    return sum(values);
}

But in this way, all the words that appears less than 3000 times would still be returned but with a value of null. I know this is because reduce function must return something thus when the 'if' statement did not match, it has to return null instead. But is there anyone who could give me useful suggestion on this -- how to return sum(values) that meets certain conditions only...

È stato utile?

Soluzione

Likely Impossible

I do not think what you are trying to do is possible. All the reduce function does is to aggregate/sum the word count across multiple documents with the same key, it will always return something for all key's you have generated in your map function.

Consider reduce/rereduce

Even if you can accept code with 'null' you have a potential bug. Have a read over: https://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Reduce_vs_rereduce

Assuming you have a few thousand emits for a key, a subset of these emits will likely be reduced in smaller segments and then be revisited in a rereduce function across all the segments.

Unless these segments (the size of which is managed by couchdb) are > 3000 elements, your query would likely mean you'll be generating a lot of 'null's and then be rereducing them. If anything your code should read:

    function(key, values, rereduce)
    {
        if(rereduce && sum(values)<3000){return 0;}
        return sum(values);
    }

Alternative Setup

I assume you have just too many words in your documents to be able to query all of them. I'd test if you can use parts of the word as the key, so for instance if you have a word "couch" and "couchdb" you'd emit these as part of a document with the key "co" or "cou" and the like

    { "couch" :  1, "couchdb" : 15 } 

You'd still have a limited number of key's you could parse and apply the 3000-rule on the rereduce. You are however at risk of falling foul of the following rule of thumb on the size of values after the reduce call:

https://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#reduced_value_sizes

Disclaimer

For the type of full text search problem you may want to look at couchdb-lucene. (I have not used it so don't know if you may be able to solve your issue.)

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top