How do I use Redis in Ruby on Rails to take the dot product of two hashes efficiently
-
29-01-2021 - |
Question
I have a data structure like this in the database in the features table called token_vector
(a hash):
Feature.find(1).token_vector = { "a" => 0.1, "b" => 0.2, "c" => 0.3 }
There are 25 of these features. First, I entered the data into Redis with this in script/console
:
REDIS.set( "feature1",
"#{ TokenVector.to_json Feature.find(1).token_vector }"
)
# ...
REDIS.set( "feature25",
"#{ TokenVector.to_json Feature.find(25).token_vector }"
)
TokenVector.to_json
converts the hash into JSON format first. The 25 JSON hashes stored in Redis take up about 8 MB.
I have a method, called Analysis#locate
. This method takes the dot product between two token_vectors. The dot product for hashes works like this:
hash1 = { "a" => 1, "b" => 2, "c" => 3 }
hash2 = { "a" => 4, "b" => 5, "c" => 6, "d" => 7 }
Each overlapping key in the hash (a, b, and c in this case, and not d) have their values multiplied pairwise together, then added up.
The value for a
in hash1
is 1, the value for a
in hash2
is 4. Multiply these to get 1*4 = 4
.
The value for b
in hash1
is 2, the value for b
in hash2
is 5. Multiply these to get 2*5 = 10
.
The value for c
in hash1
is 3, the value for c
in hash2
is 6. Multiply these to get 3*6 = 18
.
The value for d
in hash1
is nonexistent, the value for d
in hash2
is 7. In this case, set d = 0
for the first hash. Multiply these to get 0*7 = 0
.
Now add up the multiplied values. 4 + 10 + 18 + 0 = 32
. This is the dot product of hash1 and hash2.
Analysis.locate( hash1, hash2 ) # => 32
I have a method that is often used, Analysis#topicize
. This method takes in a parameter, token_vector
, which is just a hash, similar to above. Analysis#topicize
takes the dot product of token_vector
and each of the 25 features' token_vectors
, and creates a new vector of those 25 dot products, called feature_vector
. A feature_vector
is just an array. Here is what the code looks like:
def self.topicize token_vector
feature_vector = FeatureVector.new
feature_vector.push(
locate( token_vector, TokenVector.from_json( REDIS.get "feature1" ) )
)
# ...
feature_vector.push(
locate( token_vector, TokenVector.from_json( REDIS.get "feature25" ) )
)
feature_vector
end
As you can see, it takes the dot product of token_vector
and each feature's token_vector
that I entered into Redis above, and pushes the value into an array.
My problem is, this takes about 18 seconds each time I invoke the method. Am I misusing Redis? I think the problem could be that I shouldn't load Redis data into Ruby. Am I supposed to send Redis the data (token_vector
) and write a Redis function to have it do the dot_product
function, rather than writing it with Ruby code?
Solution
You would have to profile it to be sure, but I suspect you're losing a lot of time in serializing/deserializing JSON objects. Instead of turning token_vector
into a JSON string, why not put it directly into Redis, since Redis has its own hash type?
REDIS.hmset "feature1", *Feature.find(1).token_vector.flatten
# ...
REDIS.hmset "feature25", *Feature.find(25).token_vector.flatten
What Hash#flatten
does is turns a hash like { 'a' => 1, 'b' => 2 }
into an array like [ 'a', 1, 'b', 2 ]
, and then we use splat (*
) to send each element of the array as an argument to Redis#hmset
(the "m" in "hmset" is for "multiple," as in "set multiple hash values at once").
Then when you want to get it back out use Redis#hgetall
, which automatically returns a Ruby Hash:
def self.topicize token_vector
feature_vector = FeatureVector.new
feature_vector.push locate( token_vector, REDIS.hgetall "feature1" )
# ...
feature_vector.push locate( token_vector, REDIS.hgetall "feature25" )
feature_vector
end
However! Since you only care about the values, and not the keys, from the hash, you can streamline things a little more by using Redis#hvals
, which just returns an array of the values, instead of hgetall
.
The second place you might be spending a lot of cycles is in locate
, which you haven't provided the source for, but there are a lot of ways to write a dot product method in Ruby and some of them are more performant than others. This ruby-talk thread covers some valuable ground. One of the posters points to NArray, a library that implements numeric arrays and vectors in C.
If I understand your code correctly it could be reimplemented something like this (prereq: gem install narray
):
require 'narray'
def self.topicize token_vector
# Make sure token_vector is an NVector
token_vector = NVector.to_na token_vector unless token_vector.is_a? NVector
num_feats = 25
# Use Redis#multi to bundle every operation into one call.
# It will return an array of all 25 features' token_vectors.
feat_token_vecs = REDIS.multi do
num_feats.times do |feat_idx|
REDIS.hvals "feature#{feat_idx + 1}"
end
end
pad_to_len = token_vector.length
# Get the dot product of each of those arrays with token_vector
feat_token_vecs.map do |feat_vec|
# Make sure the array is long enough by padding it out with zeroes (using
# pad_arr, defined below). (Since Redis only returns strings we have to
# convert each value with String#to_f first.)
feat_vec = pad_arr feat_vec.map(&:to_f), pad_to_len
# Then convert it to an NVector and do the dot product
token_vector * NVector.to_na(feat_vec)
# If we need to get a Ruby Array out instead of an NVector use #to_a, e.g.:
# ( token_vector * NVector.to_na(feat_vec) ).to_a
end
end
# Utility to pad out array with zeroes to desired size
def pad_arr arr, size
arr.length < size ?
arr + Array.new(size - arr.length, 0) : arr
end
Hope that's helpful!
OTHER TIPS
This isn't really an answer, just a follow up to my previous comment, since this probably won't fit into a comment. It looks like the Hash/TokenVector issue might not have been the only problem. I do:
token_vector = Feature.find(1).token_vector
Analysis.locate( token_vector, TokenVector[ REDIS.hgetall( "feature1" ) ] )
and get this error:
TypeError: String can't be coerced into Float
from /Users/RedApple/S/lib/analysis/vectors.rb:26:in `*'
from /Users/RedApple/S/lib/analysis/vectors.rb:26:in `block in dot'
from /Users/RedApple/S/lib/analysis/vectors.rb:24:in `each'
from /Users/RedApple/S/lib/analysis/vectors.rb:24:in `inject'
from /Users/RedApple/S/lib/analysis/vectors.rb:24:in `dot'
from /Users/RedApple/S/lib/analysis/analysis.rb:223:in `locate'
from (irb):6
from /Users/RedApple/.rvm/rubies/ruby-1.9.2-p290/bin/irb:16:in `<main>'
Analysis#locate looks like this:
def self.locate vector1, vector2
vector1.dot vector2
end
Here is the relevant part of analysis/vectors.rb lines 23-28, the TokenVector#dot method:
def dot vector
inject 0 do |product,item|
axis, value = item
product + value * ( vector[axis] || 0 )
end
end
I am not sure where the problem is.