Is it possible to prevent fetching of remote design document in couchdb

https://stackoverflow.com/questions/22311868

12-06-2023
|

Question

Update

As @AkshatJiwanSharma suggested I have tried a few things while locally replicating. Very instructive! I have renamed the question since the problem is not that the design document gets replicated, in fact it isn't replicated, but it is fetched via an HTTP GET as part of the initial replication "negotiation" phase.

I've moved the original question to the bottom to make the new question clearer. The new question is:

It seems inefficient (particularly in the case of CouchApps) to fetch the entire design document - i.e. the entire remote app - when initiating a replication with a remote source. Can this be avoided?

It is particularly problematic in our case, on high latency links (less than 7.2Kbps), with relatively large design documents (3MB).

Remote Target

I have first tried by using a "remote" target by setting the replication target to http://127.0.0.1:5984/emr_replica.

[Fri, 08 Aug 2014 08:36:20 GMT] [info] [<0.18947.7>] Document `88fa1b1a1315d27ded663466c6003578` triggered replication `e8e66a554d198b88b6263a572a072fd3+continuous`
[Fri, 08 Aug 2014 08:36:20 GMT] [info] [<0.18946.7>] starting new replication `e8e66a554d198b88b6263a572a072fd3+continuous` at <0.18947.7> (`emr_demo` -> `http://127.0.0.1:5984/emr_replica/`)
[Fri, 08 Aug 2014 08:36:20 GMT] [info] [<0.18928.7>] 127.0.0.1 - - POST /emr_replica/_revs_diff 200
[Fri, 08 Aug 2014 08:36:20 GMT] [info] [<0.18915.7>] y.y.y.y - - GET /_utils/_sidebar.html 200
[Fri, 08 Aug 2014 08:36:20 GMT] [info] [<0.18916.7>] y.y.y.y - - GET /_replicator/88fa1b1a1315d27ded663466c6003578?revs_info=true 200

In that case the design document doesn't seem to be fetched.

Remote Source

Then setting the source as "remote" like this

{
   "_id": "88fa1b1a1315d27ded663466c6003a4a",
   "_rev": "3-b6408e98acafe729da0153c35d9df113",
   "source": "http://127.0.0.1:5984/emr_demo",
   "target": "emr_replica",
   "continuous": true,
   "filter": "emr/user_data",
   "owner": "jun"
}

Then the server fetches the remote design document before starting the replication (GET /emr_demo/_design/emr 200).

[Fri, 08 Aug 2014 08:42:17 GMT] [info] [<0.19687.7>] Document `88fa1b1a1315d27ded663466c6003a4a` triggered replication `bd8f6288970bca974dba36dbc6e5353b+continuous`
[Fri, 08 Aug 2014 08:42:17 GMT] [info] [<0.19686.7>] starting new replication `bd8f6288970bca974dba36dbc6e5353b+continuous` at <0.19687.7> (`http://127.0.0.1:5984/emr_demo/` -> `emr_replica`)
[Fri, 08 Aug 2014 08:42:17 GMT] [info] [<0.19648.7>] 127.0.0.1 - - HEAD /emr_demo/ 200
[Fri, 08 Aug 2014 08:42:17 GMT] [info] [<0.19648.7>] 127.0.0.1 - - GET /emr_demo/_design/emr 200
[Fri, 08 Aug 2014 08:42:18 GMT] [info] [<0.19656.7>] 127.0.0.1 - - GET /emr_demo/5cc2db69a32a84091b96c244273fda0e?revs=true&open_revs=%5B%221-ef8967557f2e99eb137f963daccddb3f%22%5D&latest=true 200

Further testing shows that this fetching of the design document is only done once. Further replications (including after restarting the server) only fetch the changes with the appropriate filter:

[Fri, 08 Aug 2014 09:06:36 GMT] [info] [<0.520.0>] Document `88fa1b1a1315d27ded663466c6003a4a` triggered replication `bd8f6288970bca974dba36dbc6e5353b+continuous`
[Fri, 08 Aug 2014 09:06:36 GMT] [info] [<0.519.0>] starting new replication `bd8f6288970bca974dba36dbc6e5353b+continuous` at <0.520.0> (`http://127.0.0.1:5984/emr_demo/` -> `emr_replica`)
[Fri, 08 Aug 2014 09:06:36 GMT] [info] [<0.335.0>] 127.0.0.1 - - GET /emr_demo/_changes?filter=emr%2Fuser_data&feed=continuous&style=all_docs&since=1607&heartbeat=1666 200
[Fri, 08 Aug 2014 09:06:36 GMT] [info] [<0.334.0>] 127.0.0.1 - - GET /emr_demo/5cc2db69a32a84091b96c24427560310?atts_since=%5B%2218-b613d3160bd09c45ac07a5485c9c7bce%22%5D&revs=true&open_revs=%5B%2219-d50438143337a3a0af5ed8ceb75b42f5%22%5D&latest=true 200

Former question

We're trying to use the couchdb replication over a very high latency link (slow, frequent disconnections,...). We want to avoid to replicate the design document which is heavy. We have a filter in place and when using the following curl command, the design document doesn't appear, as expected:

curl http://x.x.x.x:5984/emr/_changes?filter=emr/user_data

Our replication document is:

{
   "_id": "e0e38be8cc0b11356dfb03bc8400074d",
   "_rev": "1-d77117f03d63099e1e505b9f9de3371d",
   "source": "http://x.x.x.x:5984/emr",
   "target": "emr",
   "continuous": true,
   "filter": "emr/user_data",
   "create_target": true,
   "owner": "jun"
}

We have deactivated authentication while we're debugging. When using an existing database and removing create_target, the same problem occurs.

The source server outputs the following:

[Mon, 10 Mar 2014 21:22:03 GMT] [info] [<0.135.0>] Retrying HEAD request to http://x.x.x.x:5984/emr/ in 0.25 seconds due to error {conn_failed,{error,etimedout}}
[Mon, 10 Mar 2014 21:23:47 GMT] [info] [<0.135.0>] Retrying GET request to http://x.x.x.x:5984/emr/_design/emr in 0.25 seconds due to error req_timedout
[Mon, 10 Mar 2014 21:24:14 GMT] [error] [<0.135.0>] Replicator, request GET to "http://x.x.x.x:5984/emr/_design/emr" failed due to error {error,req_timedout}
[Mon, 10 Mar 2014 21:24:14 GMT] [error] [<0.135.0>] Replication manager, error processing document `e0e38be8cc0b11356dfb03bc8400074d`: Couldn't open document `_design/emr` from source database `http://x.x.x.x:5984/emr/`: {'EXIT',{http_request_failed,"GET","http://x.x.x.x:5984/emr/_design/emr",
                         {error,{error,req_timedout}}}}

When using tcpdump, it's clear that the replication fails because the replication manager attempts to download the heavy design document (http://x.x.x.x:5984/emr/_design/emr).

FYI the replicator's configuration is:

replicator  connection_timeout          5000    
            db                          _replicator 
            http_connections            1   
            max_replication_retry_count 3   
            retries_per_request         1   
            socket_options              [{keepalive, true}, {nodelay, true}]    
            ssl_certificate_max_depth   3   
            verify_ssl_certificates     false   
            worker_batch_size           1   
            worker_processes            1

EDIT: The user_data function (which correctly hides the design document when ran through curl as above) is :

exports.user_data = function(doc, req) {
    if (doc.collection == "visits" || doc.collection == "patients" || doc.collection == "reports") {
        return true;
    }
    return false;
}

Hope someone can help!

Solution

Suggestion

Try defining a filter function in another, small, dedicated design document and see if that fixes your problem.

// replicator document:
{
   "_id": "e0e38be8cc0b11356dfb03bc8400074d",
   "_rev": "1-d77117f03d63099e1e505b9f9de3371d",
   "source": "http://x.x.x.x:5984/emr",
   "target": "emr",
   "continuous": true,
   "filter": "small-design-doc/user_data",
   "create_target": true,
   "owner": "jun"
}

// _design/small-design-doc
// -- will be replicated, but is quite small:
{
  "_id": "_design/small-design-doc",
  "_rev": "1-...",
  "filters": {
    "user_data": "function(doc, req) { ... }"
  }
}

Explanation

According to a current snapshot of the source code, it seems the replicator is trying to fetch the design document (_design/emr) from the source database, simply because the filter function is defined there (emr/user_data).

If you specify a filter function in another design document, the replicator should try to download that very document before executing replication. So you cannot quite circumvent downloading any design document, but you are able to select which one.

Great question by the way. And very thoroughly investigated!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow