Is there an efficient implementation for quantifying the similarity between two Strings? [closed]

StackOverflow https://stackoverflow.com/questions/23381939

Question

Let's say I have several very long Strings consisting of completely random characters. I aim to represent their similarity to one designated master String in a number.

For example: 12345 is very similar 23456, but not so similar to 12abcdef

Assuming Java, is there already an efficient implementation for such an algorithm? For example I think this would probably do what I want: https://en.wikipedia.org/wiki/Levenshtein_distance but I need something very efficient for super-long Strings.

Was it helpful?

Solution 2

I am not sure if there is a java implementation for it, but you can find the implementation for your algorithm here:

http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Java

good luck :)

OTHER TIPS

The standard way is Levenshtein distance.

There is an implementation in Apache commons-lang: StringUtils.getLevenshteinDistance()

"Efficient" is unfortunately unprecise. Efficient in terms of what ? Time ? Memory ? And regards to what "quality" of similarity measure ?

Ask yourself first what similarity you want, for what purpose, with which kind of permutations/replacement allowed, etc, then you will be able to search for a "*-efficient" algorithm that computes the metrics which is adapted to your needs

you can start by this paper or this post to see the differences, or search for "string similarity metrics" on Google.

Googling seems to come up with lots of potential solutions for you. For example, you can try this one:

https://github.com/joewandy/BioinfoApp/blob/master/src/com/joewandy/bioinfoapp/model/stringDistance/LevenshteinDistance.java

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top