根据标签岗位之间欧氏距离

https://stackoverflow.com/questions/1877725

18-09-2019
|

题

我玩与欧几里得距离例如，从编程集体智慧的书，


# Returns a distance-based similarity score for person1 and person2 
def sim_distance(prefs,person1,person2): 
  # Get the list of shared_items 
  si={} 
  for item in prefs[person1]: 
    if item in prefs[person2]: 
       si[item]=1 
  # if they have no ratings in common, return 0 
  if len(si)==0: return 0 
  # Add up the squares of all the differences 
  sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) 
                      for item in prefs[person1] if item in prefs[person2]])

这是排名电影评论家的原代码，我试图修改此找到类似的帖子，根据标签我建立的地图，例如，

url1 - > tag1 tag2
url2 - > tag1 tag3

但如果这适用于功能，

pow(prefs[person1][item]-prefs[person2][item],2)

此变为0原因标签不具有相同重量的标签已排名1.我修改代码来手动创建的差来进行测试，

pow(prefs[1,2)

然后我得到了很多0.5相似的，但同样的交到它的自相似性被下降到0.3。我不能想办法把欧氏距离适用于我的情况？

解决方案

好了，第一关，你的代码看起来不完整：我看到您的功能只有一个返回。我想你的意思是这样的：

def sim_distance(prefs, person1, person2): 
  # Get the list of shared_items
  p1, p2 = prefs[person1], prefs[person2]
  si = set(p1).intersection(set(p2))

  # Add up the squares of all the differences 
  matches = (p1[item] - p2[item] for item in si)
  return sum(a * a for a in matches)

接下来，您的文章需要一些编辑为清晰。我不知道这意味着什么：“这将成为0原因标签不具有重相同的变量有排名1”

最后，如果你prefs[person1]和prefs[person2]提供的样本数据，这将有助于。然后，你可以告诉你希望得到什么你得到什么。

编辑：基于下面我的意见，我会用这样的代码：

def sim_distance(prefs, person1, person2):
    p1, p2 = prefs[person1], prefs[person2]
    s, t = set(p1), set(p2)
    return len(s.intersection(t)) / len(s.union(t))

其他提示

基本上，标签不具有权重，并且不能用数值来表示。所以，你不能定义两个标记之间的距离。

如果你想找到使用他们的标签两个职位之间的相似性，我建议你使用类似标签的比例。例如，如果你有

url1 -> tag1 tag2 tag3 tag4
url2 -> tag1 tag4 tag5 tag6

那么你有2个类似标记，代表2 (similar tags) / 4 (total tags) = 0.5。我认为这将是对相似性良好的测量，只要你有每个职位超过2个标签。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow