Better distance metrics besides Levenshtein for ordered word sets and subsequent clustering

https://stackoverflow.com/questions/4330102

29-09-2019
|

Question

I am trying to solve a problem that involves comparing large numbers of word sets , each of which contains a large, ordered number of words from a set of words (totaling around 600+, very high dimensionality!) for similarity and then clustering them into distinct groupings. The solution needs to be as unsupervised as possible.

The data looks like

[Apple, Banana, Orange...]
[Apple, Banana, Grape...]
[Jelly, Anise, Orange...]
[Strawberry, Banana, Orange...]
...etc

The order of the words in each set matters ([Apple, Banana, Orange] is distinct from [Apple, Orange, Banana]

The approach I have been using so far has been to use Levenshtein distance (limited by a distance threshold) as a metric calculated in a Python script with each word being the unique identifier, generate a similarity matrix from the distances, and throwing that matrix into k-Mediods in KNIME for the groupings.

My questions are:

Is Levenshtein the most appropriate distance metric to use for this problem?
Is mean/medoid prototype clustering the best way to go about the groupings?
I haven't yet put much thought into validating the choice for 'k' in the clustering. Would evaluating an SSE curve of the clustering be the best way to go about this?
Are there any flaws in my methodology?
As an extension to the solution in the future, given training data, would anyone happen to have any ideas for going about assigning probabilities to cluster assignments? For example, set 1 has a 80% chance of being in cluster 1, etc.

I hope my questions don't seem too silly or the answers painfully obvious, I'm relatively new to data mining.

Thanks!

Solution

Yes, Levenshtein is a very suitable way to do this. But if the sequences vary in size much, you might be better off normalising these distances by dividing by the sum of the sequence lengths -- otherwise you will find that observed distances tend to increase for pairs of long sequences whose "average distance" (in the sense of the average distance between corresponding k-length substrings, for some small k) is constant.

Example: The pair ([Apple, Banana], [Carrot, Banana]) could be said to have the same "average" distance as ([Apple, Banana, Widget, Xylophone], [Carrot, Banana, Yam, Xylophone]) since every 2nd item matches in both, but the latter pair's raw Levenshtein distance will be twice as great.

Also bear in mind that Levenshtein does not make special allowances for "block moves": if you take a string, and move one of its substrings sufficiently far away, then the resulting pair (of original and modified strings) will have the same Levenshtein score as if the 2nd string had completely different elements at the position where the substring was moved to. If you want to take this into account, consider using a compression-based distance instead. (Although I say there that it's useful for computing distances without respect to order, it does of course favour ordered similarity to disordered similarity.)

OTHER TIPS

check out SimMetrics on sourceforge for a platform supporting a variety of metrics able to use as a means to evaluate the best for a task.

for a commercially valid version check out K-Similarity from K-Now.co.uk.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow