Question

I am using mahout recommenditembased algorithm. What are the differences between all the --similarity Classes available? How to know what is the best choice for my application? These are my choices:

SIMILARITY_COOCCURRENCE
SIMILARITY_LOGLIKELIHOOD
SIMILARITY_TANIMOTO_COEFFICIENT
SIMILARITY_CITY_BLOCK
SIMILARITY_COSINE
SIMILARITY_PEARSON_CORRELATION       
SIMILARITY_EUCLIDEAN_DISTANCE

What does it mean each one?

Was it helpful?

Solution

I'm not familiar with all of them, but I can help with some.

Cooccurrence is how often two items occur with the same user. http://en.wikipedia.org/wiki/Co-occurrence

Log-Likelihood is the log of the probability that the item will be recommended given the characteristics you are recommending on. http://en.wikipedia.org/wiki/Log-likelihood

Not sure about tanimoto

City block is the distance between two instances if you assume you can only move around like you're in a checkboard style city. http://en.wikipedia.org/wiki/Taxicab_geometry

Cosine similarity is the cosine of the angle between the two feature vectors. http://en.wikipedia.org/wiki/Cosine_similarity

Pearson Correlation is covariance of the features normalized by their standard deviation. http://en.wikipedia.org/wiki/Pearson_correlation_coefficient

Euclidean distance is the standard straight line distance between two points. http://en.wikipedia.org/wiki/Euclidean_distance

To determine which is the best for you application you most likely need to have some intuition about your data and what it means. If your data is continuous value features than something like euclidean distance or pearson correlation makes sense. If you have more discrete values than something along the lines of city block or cosine similarity may make more sense.

Another option is to set up a cross-validation experiment where you see how well each similarity metric works to predict the desired output values and select the metric that works the best from the cross-validation results.

OTHER TIPS

Tanimoto and Jaccard are similars, is a statistic used for comparing the similarity and diversity of sample sets.

https://en.wikipedia.org/wiki/Jaccard_index

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top