Calculate similarity of weighted trees

https://stackoverflow.com/questions/10560205

08-06-2021
|

Pregunta

The structure of my tree is simple, the depth is two, each child node is the direct child of the root, and each node has a weight except the root. Is there a good way to measure the similarity of two trees?
Here is the original question:
Suppose you have a data list about the books that you have read. The list contains keys and values like a hashtable. The keys are the book categories, and the values are the amount of books that you have read under the current category. So every person has this data list, I want to compare the similarity of two users based on this data list. I know the collaborative-iltering can do this, but I am trying this way and compare it with cf.
So I treat the data list as a weighted tree. The categories are the child nodes, the weight of each child node is the times that this category appears in user's books.
The similarity is similar with the similarity of two users in collaborative iltering. It's a number.

Solución

This can be done using set operations.

I once implemented such a similarity metric in the Meta-CVS software, years ago. This was used to identify renamed files when importing snapshots to a branch. Of course, files can be renamed and edited between baselines, which means you cannot perform exact comparisons. But I digress.

Jaccard Index1

First of all, two users can have completely different interests in books. Or they can have completely the same interest.

What you can do is compute the size of their combined set of interests, and represent that part they have in common as a fraction of the overall size.

Suppose that the interests sets were not weighted, but simply sets categories without an associated weight. Similarity could then be expressed as the number of categories that the two users have in common, divided by the total number of categories. That is to say, the cardinality of the set intersection, divided by the cardinality of the set union.

If weights are involved, you have to work them in somehow. Perhaps calculate the total weight of the set intersection by the total weight of the union (watching out for division by zero).

As you can see, this metric yields 0.0 if the users have no categories in common, and 1.0 if they are interested in matching categories (regardless of the weight), so it is viable.

Cosine Similarity2

Another way to define the similarity would be to treat this as a vector dot product (correlation). Firstly, determine all of the categories that exist between the two users. Form a vector, for each of the two users, in which the weight of every category is present (as a zero, if the user doesn't have that category in his interest set).

Then you can calculate similarity by taking the dot product of these vectors, and dividing it by the product of their lengths: (A . B) / |A||B|

The length of a vector is the square root of the sum of the squares of the weights. (Again, watching out for division by zero.)

Under this similarity metric, if the users have no categories in common, the two vectors are perpendicular, and their dot product is zero. This metric is basically the cosine of the angle between the two vectors. The closer the cosine is to one, the more parallel the vectors are (users's interests are "aligned in the same direction").

Hope this gives you some ideas; but as you can see, this is open-ended.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow