Frage

I have some Strings and i want a measure for their similarity, but different from string edit distance for example, based more on structural similarities than on letter similarity.

For example: 312164 and 48479 should get a very high score, since they are only numbers and have same length. For Bla blubb and bla bloob blo should be the same, because they only contain letters and have gaps in between. Less score should be applied to couples like apple and app3 f, even if they share some letters, but have different structure.

Something like that... Anybody has a clue? In Java, if possible.

Thank you!

War es hilfreich?

Lösung

Define and score them in similarities.

Example strings:

Banana

Orange

Orange 123

Banana 234

Length = x point where x is the length

Same character = 1 point (A != a)

Same position for the similar character = 2 points

Deduct point for characters that are unique to each string

e.g. Compare Banana with Orange

Length = 6 points (Both are 6 in length)

For 'a' = 1 point (Both have a). If both had two a's, we would give 2 points. We would give another 2 points if 'a' was in the same position in both strings.

For 'n' = 1 point

Total positive points: 8

1 for B since Orange doesn't have B

2 for 'a' since Banana has three a's

1 for 'n' since Banana has 2 n's

1 for O

1 for r

1 for g

1 for e

Total minus: 8

total plus points - total minus points = 0

This is just a rough logic but you can derive something out of it.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top