fastest way to read multiple documents from Riak

Question 1

The recommended approach for retrieving multiple objects is to use multiple connections in order to parallelise the work and connect to all available nodes in order to spread out the load. This has the benefit that it returns all the object data as well as the metadata and results in a quorum read and read-repair being performed. The load can also be spread out across the cluster. It however works best for clients that have good support for concurrency and/or threading.

For client libraries that do not, a common approach is to try to perform multi-GETs as a MapReduce job. This is a reasonably heavyweight way to query data and requires Riak to set up and execute the MapReduce job. Running large amounts of concurrent MapReduce jobs can therefore put a lot of load on the system. It also does not result in a quorum read and read-repair will not be triggered.

This is what you are doing in the method 2 example. If you know the keys you wish to retrieve, it would however be more efficient to specify these directly rather than use a key filter as Riak has to scan a lot less objects. If you are using the LevelDB backend, you could also base your query on a secondary index lookup. In your example I also noted that you are using a JavaScript Map function. This is considerably slower than using Erlang functions and uses a pool of JavaScript VMs that are specified in the app.config file. There is an Erlang function available that returns the object value, and I believe the map phase for this should be specified as map(array("riak_kv_mapreduce", "map_object_value")).

Some time ago I did experiment a little with creating MapReduce functions that return all the important data of Riak objects, e.g. indexes, metadata and the vector clock. The results are encoded as JSON, which means that these functions are limited to data that is valid JSON. The functions and some simple examples and documentation can be found in my GitHub repository. Please note that this has not been tested extensively. I have so far also not gotten around turning the resulting output into Riak objects for the client libraries that could benefit from it.

Another way to get around the issue of having to retrieve large number of objects from Riak is to de-normalise the data model in order to ensure common queries can be served through a smaller number of requests. This is the approach I generally recommend as it generally scales well. If you have data that is read-heavy, it usually makes sense to do a bit more work when inserting or updating data in order to ensure the data can be read efficiently. Exactly how this can be done will however depend a lot on your data and you access patterns.

Question 2

If you're not using a load balancer between your PHP application and Riak, I would recommend using HAProxy to ensure your application does not connect to just one Riak node.

Riak works best when requests are spread out evenly among all nodes in the cluster.