سؤال

My objective is to classify non-speech signal for which I am using mfcc and dtw in java. However I am stuck in middle. I would appreciate any help. I have evaluated 13 mfcc values for each frame however some values are negative, I am confused whether the process I am following is right or wrong. Currently I am using the code provided by JAudio. I have also tried other code, they give me negative values as well.

Secondly, I get 13 coefficients for each frame, considering 157 frames for a certain length of sample, I get 157 sets of 13 mfccs. I am having hard time how to use all the coefficients in DTW because dtw only gives closest distance between two time signals. I do have code of DTW to compare two time signals. I am not sure how to use all the mfccs values of the signal as features.

Is there some crucial step of classification I am missing? Please help me.

هل كانت مفيدة؟

المحلول

Say you have N1 sets of 13 MFCCs each for the first signal and N2 sets of MFCCs for the second. You should compute the distance between each set in from the first signal and each set from the second (you can use the Euclidian Distance for the distance between two 13-sized arrays)

This would leave you with an N1xN2 bidimensional array on which you should now apply the DTW.

نصائح أخرى

Check out: http://code.google.com/p/aquila/ Specifically: http://code.google.com/p/aquila/source/browse/trunk/examples/dtw_distance/main.cpp which has an example codeof dtw distace calculation.

The use of DTW suppose to verify 2 audio sequences in your case. Thus, for the sequence to be verify you will have a matrix M1xN and for the query M2xN. This implies that your cost matrix will have M1xM2.

To construct the cost matrix you have to apply a distance/cost measure between the sequences, as cost(i,j) = your_chosen_multidimension_metric(M1[i,:],M2[j,:])

The resulted cost matrix will be 2D, and you could apply easily DTW.

I made a similar code for DTW based on MFCC. Below is the Python implementation which returs DTW score; x and y are the MFCC matrix of voice sequences, with M1xN and M2xN dimensions:

def my_dtw (x, y):
    cost_matrix = cdist(x, y,metric='seuclidean')
    m,n = np.shape(cost_matrix)
    for i in range(m):
        for j in range(n):
            if ((i==0) & (j==0)):
                cost_matrix[i,j] = cost_matrix[i,j]

            elif (i==0):
                cost_matrix[i,j] = cost_matrix[i,j] + cost_matrix[i,j-1]

            elif (j==0):
                cost_matrix[i,j] = cost_matrix[i,j] + cost_matrix[i-1,j]

            else:
                min_local_dist = cost_matrix[i-1,j]

                if min_local_dist > cost_matrix[i,j-1]:
                    min_local_dist = cost_matrix[i,j-1]

                if min_local_dist > cost_matrix[i-1,j-1]:
                    min_local_dist = cost_matrix[i-1,j-1]

                cost_matrix[i,j] = cost_matrix[i,j] + min_local_dist
    return cost_matrix[m-1,n-1]

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top