Is there an efficient implementation for quantifying the similarity between two Strings? [closed]

https://stackoverflow.com/questions/23381939

12-07-2023
|

题

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.

Closed 9 years ago.

Improve this question

Let's say I have several very long Strings consisting of completely random characters. I aim to represent their similarity to one designated master String in a number.

For example: 12345 is very similar 23456, but not so similar to 12abcdef

Assuming Java, is there already an efficient implementation for such an algorithm? For example I think this would probably do what I want: https://en.wikipedia.org/wiki/Levenshtein_distance but I need something very efficient for super-long Strings.

解决方案 2

I am not sure if there is a java implementation for it, but you can find the implementation for your algorithm here:

http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Java

good luck :)

其他提示

The standard way is Levenshtein distance.

There is an implementation in Apache commons-lang: StringUtils.getLevenshteinDistance()

"Efficient" is unfortunately unprecise. Efficient in terms of what ? Time ? Memory ? And regards to what "quality" of similarity measure ?

Ask yourself first what similarity you want, for what purpose, with which kind of permutations/replacement allowed, etc, then you will be able to search for a "*-efficient" algorithm that computes the metrics which is adapted to your needs

you can start by this paper or this post to see the differences, or search for "string similarity metrics" on Google.

Googling seems to come up with lots of potential solutions for you. For example, you can try this one:

https://github.com/joewandy/BioinfoApp/blob/master/src/com/joewandy/bioinfoapp/model/stringDistance/LevenshteinDistance.java

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow