Improving Rails.cache.write by setting key-value pairs asynchronously

Question 1

There are two factors that would contribute to overall latency under normal circumstances: client side marshalling/compression and network bandwidth.

Dalli mashalls and optionally compresses the data, which could be quite expensive. Here are some benchmarks of Marshalling and compressing a list of random characters (a kind of artificial list of user ids or something like that). In both cases the resulting value is around 200KB. Both benchmarks were run on a Heroku dyno - performance will obviously depend on the CPU and load of the machine:

irb> val = (1..50000).to_a.map! {rand(255).chr}; nil
# a list of 50000 single character strings
irb> Marshal.dump(val).size
275832
# OK, so roughly 200K. How long does it take to perform this operation
# before even starting to talk to MemCachier?
irb> Benchmark.measure { Marshal.dump(val) }
=>   0.040000   0.000000   0.040000 (  0.044568)
# so about 45ms, and this scales roughly linearly with the length of the list.


irb> val = (1..100000).to_a; nil # a list of 100000 integers
irb> Zlib::Deflate.deflate(Marshal.dump(val)).size
177535
# OK, so roughly 200K. How long does it take to perform this operation
irb>  Benchmark.measure { Zlib::Deflate.deflate(Marshal.dump(val)) }
=>   0.140000   0.000000   0.140000 (  0.145672)

So we're basically seeing anywhere from a 40ms to 150ms performance hit just for Marshaling and/or zipping data. Marshalling a String will be much cheaper, while marshalling something like a complex object will be more expensive. Zipping depends on the size of the data, but also on the redundancy of the data. For example, zipping a 1MB string of all "a" characters takes merely about 10ms.

Network bandwidth will play some of a role here, but not a very significant one. MemCachier has a 1MB limit on values, which would take approximately 20ms to transfer to/from MemCachier:

irb(main):036:0> Benchmark.measure { 1000.times { c.set("h", val, 0, :raw => true) } }
=>   0.250000  11.620000  11.870000 ( 21.284664)

This amounts to about 400Mbps (1MB * 8MB/Mb * (1000ms/s / 20ms)), which makes sense. However, for even a relatively large, but still smaller value of 200KB, we'd expect a 5x speedup:

irb(main):039:0> val = "a" * (1024 * 200); val.size
=> 204800
irb(main):040:0> Benchmark.measure { 1000.times { c.set("h", val, 0, :raw => true) } }
=>   0.160000   2.890000   3.050000 (  5.954258)

So, there are several things you might be able to do to get some speedup:

Use a faster marshalling mechanism. For example, using Array#pack("L*") to encode a list of 50,000 32-bit unsigned integers (like in the very first benchmark) into a string of length 200,000 (4 bytes for each integer), takes only 2ms rather than 40ms. Using compression with the same marshalling scheme, to get a similar sized value is also very fast (about 2ms as well), but the compression doesn't do anything useful on random data anymore (Ruby's Marshal produces a fairly redundant String even on a list of random integers).
Use smaller values. This would probably require deep application changes, but if you don't really need the whole list, you should be setting it. For example, the memcache protocol has append and prepend operations. If you are only ever adding new things to a long list, you could use those operations instead.

Finally, as suggested, removing the set/gets from the critical path would prevent any delays from affecting HTTP request latency. You still have to get the data to the worker, so it's important that if you're using something like a work queue, the message you send to the worker should only contain instructions on which data to construct rather than the data itself (or you're in the same hole again, just with a different system). A very lightweight (in terms of coding effort) would be to simply fork a process:

mylist = Student.where(...).all.map!(&:id)
...I need to update memcache with the new list of students...
fork do
  # Have to create a new Dalli client
  client = Dalli::Client.new
  client.set("mylistkey", mylist)
  # this will block for the same time as before, but is running in a separate process
end

I haven't benchmarked a full example, but since you're not execing, and Linux fork is copy-on-write, the overhead of the fork call itself should be minimal. On my machine, it's about 500us (that's micro-seconds not milliseconds).

Question 2

Using Rails.cache.write to prefetch and store data in cache with workers (e.g. Sidekiq) is what I've seen at high volumes. Of course there is a trade off between speed and the money you want to spend. Think about:

the most used paths in your app (is active_students accessed often?);
what to store (just IDs or the entire objects or further down the chain);
if you can optimize that query (n+1?).

Also, if you really need speed, consider using a dedicated memcache service, instead of a Heroku add-on.