Question

I have a data structure like this in the database in the features table called token_vector (a hash):

Feature.find(1).token_vector = { "a" => 0.1, "b" => 0.2, "c" => 0.3 }

There are 25 of these features. First, I entered the data into Redis with this in script/console:

REDIS.set(  "feature1",
            "#{ TokenVector.to_json Feature.find(1).token_vector }"
)
# ...
REDIS.set(  "feature25",
            "#{ TokenVector.to_json Feature.find(25).token_vector }"
)

TokenVector.to_json converts the hash into JSON format first. The 25 JSON hashes stored in Redis take up about 8 MB.

I have a method, called Analysis#locate. This method takes the dot product between two token_vectors. The dot product for hashes works like this:

hash1 = { "a" => 1, "b" => 2, "c" => 3 }
hash2 = { "a" => 4, "b" => 5, "c" => 6, "d" => 7 }

Each overlapping key in the hash (a, b, and c in this case, and not d) have their values multiplied pairwise together, then added up.

The value for a in hash1 is 1, the value for a in hash2 is 4. Multiply these to get 1*4 = 4.

The value for b in hash1 is 2, the value for b in hash2 is 5. Multiply these to get 2*5 = 10.

The value for c in hash1 is 3, the value for c in hash2 is 6. Multiply these to get 3*6 = 18.

The value for d in hash1 is nonexistent, the value for d in hash2 is 7. In this case, set d = 0 for the first hash. Multiply these to get 0*7 = 0.

Now add up the multiplied values. 4 + 10 + 18 + 0 = 32. This is the dot product of hash1 and hash2.

Analysis.locate( hash1, hash2 ) # => 32

I have a method that is often used, Analysis#topicize. This method takes in a parameter, token_vector, which is just a hash, similar to above. Analysis#topicize takes the dot product of token_vector and each of the 25 features' token_vectors, and creates a new vector of those 25 dot products, called feature_vector. A feature_vector is just an array. Here is what the code looks like:

def self.topicize token_vector

  feature_vector = FeatureVector.new

  feature_vector.push(
    locate( token_vector, TokenVector.from_json( REDIS.get "feature1" ) )
  )
  # ...
  feature_vector.push(
    locate( token_vector, TokenVector.from_json( REDIS.get "feature25" ) )
  )

  feature_vector

end

As you can see, it takes the dot product of token_vector and each feature's token_vector that I entered into Redis above, and pushes the value into an array.

My problem is, this takes about 18 seconds each time I invoke the method. Am I misusing Redis? I think the problem could be that I shouldn't load Redis data into Ruby. Am I supposed to send Redis the data (token_vector) and write a Redis function to have it do the dot_product function, rather than writing it with Ruby code?

Was it helpful?

Solution

You would have to profile it to be sure, but I suspect you're losing a lot of time in serializing/deserializing JSON objects. Instead of turning token_vector into a JSON string, why not put it directly into Redis, since Redis has its own hash type?

REDIS.hmset "feature1",   *Feature.find(1).token_vector.flatten
# ...
REDIS.hmset "feature25",  *Feature.find(25).token_vector.flatten

What Hash#flatten does is turns a hash like { 'a' => 1, 'b' => 2 } into an array like [ 'a', 1, 'b', 2 ], and then we use splat (*) to send each element of the array as an argument to Redis#hmset (the "m" in "hmset" is for "multiple," as in "set multiple hash values at once").

Then when you want to get it back out use Redis#hgetall, which automatically returns a Ruby Hash:

def self.topicize token_vector
  feature_vector = FeatureVector.new

  feature_vector.push locate( token_vector, REDIS.hgetall "feature1" )
  # ...
  feature_vector.push locate( token_vector, REDIS.hgetall "feature25" )

  feature_vector
end

However! Since you only care about the values, and not the keys, from the hash, you can streamline things a little more by using Redis#hvals, which just returns an array of the values, instead of hgetall.

The second place you might be spending a lot of cycles is in locate, which you haven't provided the source for, but there are a lot of ways to write a dot product method in Ruby and some of them are more performant than others. This ruby-talk thread covers some valuable ground. One of the posters points to NArray, a library that implements numeric arrays and vectors in C.

If I understand your code correctly it could be reimplemented something like this (prereq: gem install narray):

require 'narray'

def self.topicize token_vector
  # Make sure token_vector is an NVector
  token_vector  = NVector.to_na token_vector unless token_vector.is_a? NVector
  num_feats     = 25

  # Use Redis#multi to bundle every operation into one call.
  # It will return an array of all 25 features' token_vectors.
  feat_token_vecs = REDIS.multi do
    num_feats.times do |feat_idx|
      REDIS.hvals "feature#{feat_idx + 1}"
    end
  end 

  pad_to_len = token_vector.length

  # Get the dot product of each of those arrays with token_vector
  feat_token_vecs.map do |feat_vec|
    # Make sure the array is long enough by padding it out with zeroes (using
    # pad_arr, defined below). (Since Redis only returns strings we have to
    # convert each value with String#to_f first.)
    feat_vec = pad_arr feat_vec.map(&:to_f), pad_to_len

    # Then convert it to an NVector and do the dot product
    token_vector * NVector.to_na(feat_vec)

    # If we need to get a Ruby Array out instead of an NVector use #to_a, e.g.:
    # ( token_vector * NVector.to_na(feat_vec) ).to_a
  end
end

# Utility to pad out array with zeroes to desired size
def pad_arr arr, size
  arr.length < size ?
    arr + Array.new(size - arr.length, 0) : arr
end

Hope that's helpful!

OTHER TIPS

This isn't really an answer, just a follow up to my previous comment, since this probably won't fit into a comment. It looks like the Hash/TokenVector issue might not have been the only problem. I do:

token_vector = Feature.find(1).token_vector
Analysis.locate( token_vector, TokenVector[ REDIS.hgetall( "feature1" ) ] )

and get this error:

TypeError: String can't be coerced into Float
from /Users/RedApple/S/lib/analysis/vectors.rb:26:in `*'
from /Users/RedApple/S/lib/analysis/vectors.rb:26:in `block in dot'
from /Users/RedApple/S/lib/analysis/vectors.rb:24:in `each'
from /Users/RedApple/S/lib/analysis/vectors.rb:24:in `inject'
from /Users/RedApple/S/lib/analysis/vectors.rb:24:in `dot'
from /Users/RedApple/S/lib/analysis/analysis.rb:223:in `locate'
from (irb):6
from /Users/RedApple/.rvm/rubies/ruby-1.9.2-p290/bin/irb:16:in `<main>'

Analysis#locate looks like this:

def self.locate vector1, vector2
  vector1.dot vector2
end

Here is the relevant part of analysis/vectors.rb lines 23-28, the TokenVector#dot method:

def dot vector
  inject 0 do |product,item|
    axis, value = item
    product + value * ( vector[axis] || 0 )
  end
end

I am not sure where the problem is.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top