CouchDB delay building index (CouchDB 1.5.0 on Windows Server 2008 R2)

https://stackoverflow.com/questions/22966594

couchdb

30-06-2023
|

Question

I understand that CouchDB hashes the source of each design documents against the name of the index file. Whenever I change the source code, the index needs to be rebuild. CouchDB does this when the document is requested for the first time.

What I'd expect to happen and want to happen

Each time I change a design doc, the first call to a view will take significantly longer than usual and may time out. The index will continue to build. Once this is completed, the view will only process changes and will be very fast.

What actually happens

When running an amended view for the first time, I see the process in the status window, slowly reach 100%. This takes about 2 hours. During this time all CPU's are fully utilized.
Once process reaches 99% it remains there for about an hour and then disappears. CPU utilization drops to just one cpu.
When the process has disappeared, the data file for the view keeps growing for about half an hour to an hour. CPU utilization is near 0%
The index file suddenly stops to increase in size.

If I request the view again when I've reached state 4), the characteristics of 3) start again. I have to repeat this process between 5 to 50 times until I can finally retrieve the view values.

If the view get's requested a second time whilst till in stage 1 or 2, it will most definitely run out of memory and I have to restart the CouchDB service. This is despite my DB rarely using more than 2 GByte when runninng just one job and more than 4 GByte free in usual operation.

I have tried to tweak configuration settings, add more memory, but nothing seems to have an impact.

My Question

Do I misunderstand the concept of running views or is something wrong with my setup? If this is expected, is there anything I can tweak to reduce the number of reruns?

Context

My documents are pretty large (1 to 20 MByte). The data they contain is well structured, they are usually web-analytics reports and would in a relational database be stored as several 10k rows of data.

My map function extracts these rows. It returns the dimensions as key array. The key array sometimes exceeds 20 columns. Most views will only have less than 10 columns.

The reduce function will aggregate (sum) all values in rows with identical keys. The metrics are stored in a dictionary and may contain different keys. The reduce function identifies missing keys in one document and adds these to the aggregate as 0.

I am using CouchDB 1.5.0 on Windows Server 2008 R2 with 2CPUs and 8 GByte memory.

The views are written in javascript using the couchjs query server.

My designs documents usually consist of several views, with a '_lib' view that does not emit any data, but contains an exhaustive library of functions accessed by the actual views.

Solution

It is a known issue, but just in case: if you have gigabytes of docs, you can forget about reduce functions. Only build-in ones will work fast enough.

OTHER TIPS

It is possible to set os_process_limit to an extra-low value (1 sec, for sample). This way you can detect which doc takes long to be indexed and optimize your map function for performance.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow