How to use MapReduce when extracting a group of document id's by some criteria from CouchDB

https://stackoverflow.com/questions/22576092

19-06-2023
|

Question

I'm in my first week of CouchDB experimentation and trying to stop thinking in SQL. I have a collection of documents (5000 event files) that all have some ID value that will be common to groups of documents. So there might be 10 that all have TheID: 'foobar'.

(In case someone asks - TheID is not an auto-increment value from a relational database - it is a unique id assigned by a partner company of ours. I cannot redesign my source data to identify itself some other way, I have to use this TheID field to recognise groups of documents.)

I want to query my list of documents:

{ _id: 'document1', Message: { TheID: 'foobar' } }
{ _id: 'document2', Message: { TheID: 'xyz' } }
{ _id: 'document3', Message: { TheID: 'xyz' } }
{ _id: 'document4', Message: { TheID: 'foobar' } }
{ _id: 'document5', Message: { TheID: 'wibble' } }
{ _id: 'document6', Message: { TheID: 'foobar' } }

I want the results:

'foobar': [ 'document1', 'document4', 'document6' ]
'xyz': [ 'document2', 'document3' ]
'wibble': [ 'document5' ]

The aim is to represent groups of documents on our UI grouped by TheID, so the user can see all documents for a specific TheID together, and select that TheID to drill into the data querying just by that TheID value. Yes, the string id of each document is useful - in our case, the _id value of each document is the source event identifier, so it is a unique and useful value that the user is going to want to see in the list on screen.

In SQL one might order by or group by the TheID field and iterate the result set appropriately. I doubt this thinking is any use at all with a CouchDB query.

I know that I can use a map function to extract the TheID value for each document, for example:

function (doc) {
  emit(doc.Message.TheID, 1);
}

or perhaps

function (doc) {
  emit(doc._id, doc.Message.TheID);
}

I'm not sure exactly what I should emit as the key and value. Even if this is useful, I'm getting the feeling that I should not use a reduce function to try to 'reduce' the large map output (1 result row per document in the database) to what I want (3 results each with a list of document id's).

http://guide.couchdb.org/draft/views.html says "A common mistake new CouchDB users make is attempting to construct complex aggregate values with a reduce function. Full reductions should result in a scalar value, like 5, and not, for instance, a JSON hash with a set of unique keys and the count of each."

I thought I might be able to use reduce to scan the results of the map and somehow collect all results that have a common TheID value into a single result object. What I see when reading the reduce documentation is that it will be given arrays of keys and values that contain fairly unpredictable collections, driven by the structure of the btree underlying the map results. It won't be given arrays guaranteed to contain all similar TheID values that I could scan for. This approach seems completely broken.

So, is a map/reduce pair the right thing to do here? Should I look at using a 'show' or 'list' instead? I'm intending to build a mustache based HTML template engine around the results, so 'list' seems the wrong way to go.

Thanks in advance for any guidance.

EDIT I have done some local dev and come up with what I think is a broken solution. Hopefully this will show you the direction I'm trying to go in. See a public cloud based CouchDB I created at https://neek.iriscouch.com/_utils/database.html?test/_design/test/_view/collectByTheID

This is public. If you would like to play, please copy it to a new view, don't pollute this one in case others come in and want to see the original.

map function:

function(doc) {
  emit(doc.Message.TheID, doc._id);
}

reduce function:

function(keys, values, rereduce) {
  if (!rereduce) {
    return values;
  } else {
    var ret = [];
    values.forEach(function (ar) {
      ret.concat(ar);
    });
    return ret;
  }
}

Results:

"foobar"   ["document6", "document4", "document1"]
"wibble"   ["document5"]
"xyz"      ["document3", "document2"]

The reduce function first leaves the array of values alone, and on the second pass concatenates them together. However when I run this on my large 5000+ document database it comes up with some TheID values with empty document id arrays. I believe this suffers from the problem I mentioned before, where the array of values passed to reduce are build dependent on the btree structure of the map they are extractd from and are not guaranteed to contain a complete set of values for given keys.

Solution

Make use of the group_level feature:

Map:

emit([doc.message.TheID, doc._id], null)

Reduce:

You must include a reduce to use group_level, it can be empty as below or something else, i.e. _count

function(keys, values){
   return null;
}

A query with group_level=1 would return:

/_design/d/_view/v?group_level=1

[
 {key: ["foobar"], value: null}, 
 {key: ["xyz"], value: null}, 
 {key: ["wibble"], value: null}
]

You would use this query to populate the top level in your grouping UI. When the user expands a category, you would do another query with group_level 2 and start and end keys:

/_design/d/_view/v?group_level=2&startkey=["foobar"]&endkey=["foobar",{}]

[
  {key: ["foobar", "document6"], value: null}, 
  {key: ["foobar", "document4"], value: null}, 
  {key: ["foobar", "document1"], value: null}
]

This doesn't produce the output exactly as you are requesting, however, I think you'll find it flexible enough

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow