Question

I have various length strings which are full of Base64 chars. Actualy they are audio recognition datas differs by song-to-song.

For easily comparing parts of those strings i divide them into 16-char sub-strings. (which is about 1 second of a song) But in some cases, i just can't compare these ones head to head.. i should be measuring them.

For example comparison with 'hellohellohelloo' and 'hallohellohelloo' should get a closer value then 'hellohellohelloo' and 'herehellohelloo' comparison.

Is there any algorithm or theorical


Edit: Sorry, i am new here :) And i couldn't make myself clear. Here are some comments that will make me clear and proposes an idea.

Comment 1:

Actually i know about Levenshtein distance, but the problem is every time i compare two strings i have to build comparison matrix and that makes searching process slow. If i can convert for example hello to 4444 and hallo to 4443 i can determine how close records i have for 'hello' by just indexing numerical values.

Comment 2:

Maybe i should determine a base constant-length string(s) and store distance values from them as the index values for string. It's just an idea?!

Was it helpful?

Solution

Levenshtein's distance will probably help you : http://en.wikipedia.org/wiki/Levenshtein_distance

It's usually pretty fast, and there are implementations in most modern languages too.

OTHER TIPS

The Levenshtein distance might work for you. Also see the Wikipedia overview of edit distance.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top