Is there an efficient implementation for quantifying the similarity between two Strings? [closed]

StackOverflow https://stackoverflow.com/questions/23381939

Let's say I have several very long Strings consisting of completely random characters. I aim to represent their similarity to one designated master String in a number.

For example: 12345 is very similar 23456, but not so similar to 12abcdef

Assuming Java, is there already an efficient implementation for such an algorithm? For example I think this would probably do what I want: https://en.wikipedia.org/wiki/Levenshtein_distance but I need something very efficient for super-long Strings.

有帮助吗?

解决方案 2

I am not sure if there is a java implementation for it, but you can find the implementation for your algorithm here:

http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Java

good luck :)

其他提示

The standard way is Levenshtein distance.

There is an implementation in Apache commons-lang: StringUtils.getLevenshteinDistance()

"Efficient" is unfortunately unprecise. Efficient in terms of what ? Time ? Memory ? And regards to what "quality" of similarity measure ?

Ask yourself first what similarity you want, for what purpose, with which kind of permutations/replacement allowed, etc, then you will be able to search for a "*-efficient" algorithm that computes the metrics which is adapted to your needs

you can start by this paper or this post to see the differences, or search for "string similarity metrics" on Google.

Googling seems to come up with lots of potential solutions for you. For example, you can try this one:

https://github.com/joewandy/BioinfoApp/blob/master/src/com/joewandy/bioinfoapp/model/stringDistance/LevenshteinDistance.java

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top