Question

I'm trying to get the pitch class from recorded voice (44.1 kHz) using autocorrelation. What I'm doing is basically described here: http://cnx.org/content/m11714/latest/ and also implemented here: http://code.google.com/p/yaalp/source/browse/trunk/csaudio/WaveAudio/WaveAudio/PitchDetection.cs (the part using PitchDetectAlgorithm.Amdf)

So in order to detect the pitch class I build up an array with the normalized correlation for the frequencies of C2 to B3 (2 octaves) and select the one with the highest value (doing a "1 - correlation" transformation first so not searching for minimum but maximum)

I tested it with generated audio (simple sinus):

data[i] = (short)(Math.Sin(2 * Math.PI * i/fs * freq) * short.MaxValue);

But it only works for input frequencies lower than B4. Investigating the generated array I found that starting from G3 another peek evolved that eventually gets bigger than the correct one. and my B4 is detected as an E. Changing the number of analysed frequencies did not help at all.

My buffer size is 4000 samples and frequency of B4 is ~493Hz, so I cannot think of a reason why this is failing. Are there any more constraints on the frequencies or buffer sizes? What is going wrong there?

I'm aware that I could use FFT like Performous is using, but using this method looked simple and also gives weighted frequencies that can be used to show visualisations. I don't want to throw it away that easily and at least understand why this fails.

Update: Core function used:

private double _GetAmdf(int tone)
    {
        int samplesPerPeriod = _SamplesPerPeriodPerTone[tone]; // samples in one period
        int accumDist = 0; // accumulated distances
        int sampleIndex = 0; // index of sample to analyze
        // Start value= index of sample one period ahead
        for (int correlatingSampleIndex = sampleIndex + samplesPerPeriod; correlatingSampleIndex < _AnalysisBufLen; correlatingSampleIndex++, sampleIndex++)
        {
            // calc distance (correlation: 1-dist/IntMax*2) to corresponding sample in next period (0=equal .. IntMax*2=totally different)
            int dist = Math.Abs(_AnalysisBuffer[sampleIndex] - _AnalysisBuffer[correlatingSampleIndex]);
            accumDist += dist;
        }

        return 1.0 - (double)accumDist / Int16.MaxValue / sampleIndex;
    }

With that function, the pitch/tone is (pseudocode)

tone = Max(_GetAmdf(tone)) <- for tone = C2..

I also tried using actual autocorrelation with:

double accumDist=0;
//...
double dist = _AnalysisBuffer[sampleIndex] * _AnalysisBuffer[correlatingSampleIndex];
//...
const double scaleValue = (double)Int16.MaxValue * (double)Int16.MaxValue;
return accumDist / (scaleValue * sampleIndex);

but that fails getting an A3 as an D in addition to B4 as an E

Note: I do not divide by Bufferlength but by the number of samples actually compared. Not sure if this is right, but it seems logic.

Was it helpful?

Solution

This is the common octave problem with using autocorrelation and similar lag estimations of pitch (AMDF, ASDF, etc.)

A frequency that is one octave (or any other integer multiple) lower will also give as good a match in shifted waveform similarity (e.g. a sinewave shifted by 2pi looks the same as one shifted by 4pi, which represents an octave lower. Depending on noise and how close the continuous peak is to the sampled peak, one or the other estimation peak may be slightly higher, with no change in pitch.

So some other test needs to be used to remove lower octave (or other submultiple frequency) peaks in the waveform correlation or lag matching (e.g. does a peak look close enough like one or more other peaks, one or more octaves or other frequency multiples up, etc.)

OTHER TIPS

I don't know c#, but if the tiny amount of code you've supplied is correct and like most other c-like languages, it is introducing a huge amount of what's called intermodular distortion.

In most c-like languages (and most other languages I know, like java), the output of something like Math.sin() would be in the range [-1,1]. Upon casting to an int, short or long, this would change to [-1,0]. Essentially, you will have changed your sine wave to a very distorted square wave with many overtones, which may be what these libraries are picking up.

Try this:

data[i] = (short)(32,767 * Math.Sin(2 * Math.PI * i/fs * freq));

besides all that was spoken by @Bjorn and @Hotpaw, in the past i found the problems described by @hotpaw2.

Was not clear from your code if you are computing with difference of one sample (as I have ever seen in equations to compute AMDF) !

I did in java, you can find in Tarsos the full source code !

Here the equivalent steps from your post in java:

    int maxShift = audioBuffer.length;


    for (int i = 0; i < maxShift; i++) {
        frames1 = new double[maxShift - i + 1];
        frames2 = new double[maxShift - i + 1];
        t = 0;
        for (int aux1 = 0; aux1 < maxShift - i; aux1++) {
            t = t + 1;
            frames1[t] = audioBuffer[aux1];

        }
        t = 0;
        for (int aux2 = i; aux2 < maxShift; aux2++) {
            t = t + 1;
            frames2[t] = audioBuffer[aux2];
        }

        int frameLength = frames1.length;
        calcSub = new double[frameLength];
        for (int u = 0; u < frameLength; u++) {
            calcSub[u] = frames1[u] - frames2[u];
        }

        double summation = 0;
        for (int l = 0; l < frameLength; l++) {
            summation +=  Math.abs(calcSub[l]);
        }
        amd[i] = summation;
    }
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top