Similarity Threshold Standards

https://datascience.stackexchange.com/questions/86560

17-12-2020
|

Question

When using similarity measures (eg. Resnik Information Content, Cosine Similarity, etc.) for any type of data, are there any standard similarity thresholds that are used, or does it all depend on the situation? A similarity threshold would be the value X in [0,1] such that all pairs with similarity score greater than X are "connected" while ones with similarity score below X are not.

Also, are low similarity thresholds (~0.15) acceptable when higher thresholds simply do not produce enough "connected" pairs and having a low similarity threshold still works well in practice?

La solution

I don't think there's any standard, but there might be some exceptions in very specific cases where the distribution of the scores is known precisely.

There's no standard because in general the optimal value of the threshold strongly depends on the task and the data. That's why thresholds are usually determined empirically based on the desired outcome. In other words, a threshold can be seen as an hyper-parameter: its optimal value can be found by maximizing the performance of the target task on a training set (or validation set).

Licencié sous: CC-BY-SA avec attribution

Non affilié à datascience.stackexchange