Recognizing the pitch of a sound with matlab

Question 1

You are on the correct track but this is not a simple problem. What I would suggest looking into is something called a chromagram. This will use information that you gathered from the spectrogram and "bin" it into piano note frequencies. This will give an approximation of a songs harmonic content. This may not be entirely accurate though because of residual energy in the note's harmonics, but it is a start.

Do realize that transcription, which is what you are doing, is a very difficult task and has yet to be 100% solved. People are still researching this to today. I have code to generate chroma, but I will have to dig for it.

EDIT

Here is some code to chroma

clc; close all; clear all;
% didn't have wav file, but simply replace this with the following
% [audio,fs] = wavread('audioFile.wav')
audio = rand(1,10000);
fs = 44100; % temp sampling frequency, will depend on audio input
NFFT = 1024; % feel free to change FFT size
hamWin = hamming(NFFT); % window your audio signal to avoid fft edge effects

% get spectral content
S = spectrogram(audio,hamWin,NFFT/2,NFFT,fs);

% Start at center lowest piano note
A0 = 27.5;
% all 88 keys
keys = 0:87;
center = A0*2.^((keys)/12); % set filter center frequencies
left = A0*2.^((keys-1)/12); % define left frequency
left = (left+center)/2.0;
right = A0*2.^((keys+1)/12); % define right frequency
right = (right+center)/2;

% Construct a filter bank
filter = zeros(numel(center),NFFT/2+1); % place holder
freqs = linspace(0,fs/2,NFFT/2+1); % array of frequencies in spectrogram
for i = 1:numel(center)
    xTemp = [0,left(i),center(i),right(i),fs/2]; % create points for filter bounds
    yTemp = [0,0,1,0,0]; % set magnitudes at each filter point
    filter(i,:) = interp1(xTemp,yTemp,freqs); % use interpolation to get values for   frequencies
end

% multiply filter by spectrogram to get chroma values.
chroma = filter*abs(S);

%Put into 12 bin chroma
chroma12 = zeros(12,size(chroma,2));
for i = 1:size(chroma,1)
    bin = mod(i,12)+1; % get modded index
    chroma12(bin,:) = chroma12(bin,:) + chroma(i,:); % add octaves together
end

That should do the trick. It may not be the fastest solution, but it should get the job done.

Surely it can be optimized.

Question 2

As MZimmerman6 this is a very complex problem. Peak to peak measuring may be successful, but will certainly not if the music gets anymore complicated. I have tackled this problem before and seen other people try it as well and the most successful projects among my peers I have seen involve the following:

1) Constrain the time. It may actually be difficult for a program to determine when a note is even changing! This is especially true if you are trying to separate vocals from instrumentals, or for example when two chords play sequentially, but they have one note that stays the same between them. So by constrain the time it is meant find out when each chunk of music happens, so in your case divide the track into four tracks, one for each note. You may be able to use the attack of each note to your advantage, to automatically detect the attack as the beginning of a new segment to test.

2) Constrain the frequencies. You have to use what you know, otherwise you will need to make eigenmode comparisons. Singular value decomposition has been effective in this arena. But if you somehow have the piano playing separate notes (individually), and you have recordings of the piano playing the song, what you can do is a fast fourier transform of each segment (see above time constraints), cut out the noise, and compare them. Then you employ a subtractive method or another metric to determine the best "fit" for each note.

This is a rough explanation of the concerns, but trust me, the more constraints you can put on this sort of analysis the better.