Question

I have tried reading the documents in different format, I have an array of keys to use for reading. Each batch contains 1000 keys. I have 6 riak nodes. I do reads with r=1. I connect each time to the same riak node. Documents are just profile fields, so it's not something big. I've checked the CPU and disk usage on the nodes, and I can observe just a slight move, but not even close to overall 20% CPU.

Method 1: multiGet (code section linked here):

$data = $riak->multiGet($ids); /*EXECUTION TIME 80seconds*/

Method 2: key_filter (code section linked here)

$data = $riak->multiGetBetween($id1, $id2); /* GIVES RIAK INTERNAL TIMEOUT*/

Method 3: one by one get

foreach ($ids as $key) {
    $riak->get($key);
    $data[$key] = $riak->document->data;
} /* EXECUTION TIME 20seconds */

As you see the method 3 is the better, but the problem I have with all of them, is that I cannot run more than 2 threads. If I try to run more, I get socket timeout connection. I checked the linux open limits and it's 240k. I ran out of options what to try and do here. Any ideas?

Was it helpful?

Solution

The recommended approach for retrieving multiple objects is to use multiple connections in order to parallelise the work and connect to all available nodes in order to spread out the load. This has the benefit that it returns all the object data as well as the metadata and results in a quorum read and read-repair being performed. The load can also be spread out across the cluster. It however works best for clients that have good support for concurrency and/or threading.

For client libraries that do not, a common approach is to try to perform multi-GETs as a MapReduce job. This is a reasonably heavyweight way to query data and requires Riak to set up and execute the MapReduce job. Running large amounts of concurrent MapReduce jobs can therefore put a lot of load on the system. It also does not result in a quorum read and read-repair will not be triggered.

This is what you are doing in the method 2 example. If you know the keys you wish to retrieve, it would however be more efficient to specify these directly rather than use a key filter as Riak has to scan a lot less objects. If you are using the LevelDB backend, you could also base your query on a secondary index lookup. In your example I also noted that you are using a JavaScript Map function. This is considerably slower than using Erlang functions and uses a pool of JavaScript VMs that are specified in the app.config file. There is an Erlang function available that returns the object value, and I believe the map phase for this should be specified as map(array("riak_kv_mapreduce", "map_object_value")).

Some time ago I did experiment a little with creating MapReduce functions that return all the important data of Riak objects, e.g. indexes, metadata and the vector clock. The results are encoded as JSON, which means that these functions are limited to data that is valid JSON. The functions and some simple examples and documentation can be found in my GitHub repository. Please note that this has not been tested extensively. I have so far also not gotten around turning the resulting output into Riak objects for the client libraries that could benefit from it.

Another way to get around the issue of having to retrieve large number of objects from Riak is to de-normalise the data model in order to ensure common queries can be served through a smaller number of requests. This is the approach I generally recommend as it generally scales well. If you have data that is read-heavy, it usually makes sense to do a bit more work when inserting or updating data in order to ensure the data can be read efficiently. Exactly how this can be done will however depend a lot on your data and you access patterns.

OTHER TIPS

If you're not using a load balancer between your PHP application and Riak, I would recommend using HAProxy to ensure your application does not connect to just one Riak node.

Riak works best when requests are spread out evenly among all nodes in the cluster.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top