How to return a Ruby array intersection with duplicate elements? (problem with bigrams in Dice Coefficient)

StackOverflow https://stackoverflow.com/questions/1600168

Question

I'm trying to script Dice's Coefficient, but I'm having a bit of a problem with the array intersection.

def bigram(string)
  string.downcase!
  bgarray=[]
  bgstring="%"+string+"#"
  bgslength = bgstring.length
  0.upto(bgslength-2) do |i|
    bgarray << bgstring[i,2]
   end
   return bgarray
 end

def approx_string_match(teststring, refstring)
  test_bigram = bigram(teststring) #.uniq
  ref_bigram = bigram(refstring)   #.uniq

  bigram_overlay = test_bigram & ref_bigram

  result = (2*bigram_overlay.length.to_f)/(test_bigram.length.to_f+ref_bigram.length.to_f)*100

  return result
end

The problem is, as & removes duplicates, I get stuff like this:

string1="Almirante Almeida Almada"
string2="Almirante Almeida Almada"

puts approx_string_match(string1, string2) => 76.0%

It should return 100.

The uniq method nails it, but there is information loss, which may bring unwanted matches in the particular dataset I'm working.

How can I get an intersection with all duplicates included?

Was it helpful?

Solution

As Yuval F said you should use multiset. However, there is nomultiset in Ruby standard library , Take at look at here and here.

If performance is not that critical for your application, you still can do it usingArray with a little bit code.

def intersect  a , b  
    a.inject([]) do |intersect, s|
      index = b.index(s)
      unless index.nil?
         intersect << s
         b.delete_at(index)
      end
      intersect        
    end
end

a=  ["al","al","lc" ,"lc","ld"]
b = ["al","al" ,"lc" ,"ef"]
puts intersect(a ,b).inspect   #["al", "al", "lc"]

OTHER TIPS

From this link I believe you should not use Ruby's sets but rather multisets, so that every bigram gets counted the number of times it appears. Maybe you can use this gem for multisets. This should give a correct behavior for recurring bigrams.

I toyed with this, based on the answer from @pierr, for a while and ended up with this.

a = ["al","al","lc","lc","lc","lc","ld"]
b = ["al","al","al","al","al","lc","ef"]
result=[]
h1,h2=Hash.new(0),Hash.new(0)
a.each{|x| h1[x]+=1}
b.each{|x| h2[x]+=1}
h1.each_pair{|key,val| result<<[key]*[val,h2[key]].min if h2[key]!=0}
result.flatten

=> ["al", "al", "lc"]

This could be a kind of multiset intersect of a & b but don't take my word for it because I haven't tested it enough to be sure.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top