Dynamic time warping to compare two audio recordings

https://stackoverflow.com/questions/2168027

24-09-2019
|

Question

I'd like to use Dynamic Time Warping to compare two feature vectors for two audio recordings (of course I'm doing all the necessary preprocessing first). My program should output the similarity between the two audio recordings in percent. For example 100% means that the two recordings are completely identical, and the more different are the recordings, the lower number I get. How do I get around to it? The DTW only gives me the length of the path or the cost of the transition and I don't know how to convert one of these numbers to a percent value.

Solution

I'm not aware of any distance metric between signals that is measured by percent. If there is a meaning of 100%, then there must be a meaning of 0%. So first you need to ask yourself: what does 0% mean?

For DTW, I'm pretty sure that there is no established conversion of minimum distance to "percent match". If you must, then you need to define a heuristic quantity that is a function of the minimum DTW distance.

EDIT: Actually, you could sort of define a longest distance if you have two finite-length recordings. That would be the distance of a path that went (if looking at the cost matrix) all the way right then down, or all the way down then right. The best path, i.e. perfect match, goes down the main diagonal.

One simple idea: if using (0,1) (1,0) (1,1) as step candidates, you could maybe use the number of steps taken by (0,1) and (1,0) as a measure of badness. This measure certainly has a maximum and a minimum, so then it could be mapped to some desirable range like 0-100%.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow