How to convert a pitch track from a melody extraction algorithm to a humming like audio signal

Question 1

Though I don't have access to your synth() function, based on the parameters it takes I'd say your problem is because you're not handling the phase.

That is - it is not enough to concatenate waveform snippets together, you must ensure that they have continuous phase. Otherwise, you're creating a discontinuity in the waveform every time you concatenate two waveform snippets. If this is the case, my guess is that you're hearing the same frequency all the time and that it sounds more like a sawtooth than a sinusoid - am I right?

The solution is to set the starting phase of snippet n to the end phase of snippet n-1. Here's an example of how you would concatenate two waveforms with different frequencies without creating a phase discontinuity:

fs = 44100; % sampling frequency

% synthesize a cosine waveform with frequency f1 and starting additional phase p1
p1 = 0;
dur1 = 1;
t1 = 0:1/fs:dur1; 

x1(1:length(t1)) = 0.5*cos(2*pi*f1*t1 + p1);

% Compute the phase at the end of the waveform
p2 = mod(2*pi*f1*dur1 + p1,2*pi);

dur2 = 1;
t2 = 0:1/fs:dur2; 
x2(1:length(t2)) = 0.5*cos(2*pi*f2*t2 + p2); % use p2 so that the phase is continuous!

x3 = [x1 x2]; % this should give you a waveform without any discontinuities

Note that whilst this gives you a continuous waveform, the frequency transition is instantaneous. If you want the frequency to gradually change from time_n to time_n+1 then you would have to use something more complex like McAulay-Quatieri interpolation. But in any case, if your snippets are short enough this should sound good enough.

Regarding other comments, if I understand correctly your goal is just to be able to hear the frequency sequence, not for it to sound like the original source. In this case, the amplitude is not that important and you can keep it fixed.

If you wanted to make it sound like the original source that's a whole different story and probably beyond the scope of this discussion.

Hope this answers your question!

Question 2

If I understand correctly, you seem to already have an accurate representation of the pitch but your problem is that what you generate just doesn't "sound good enough".

Starting with your second approach: filtering out anything but the pitch isn't going to lead to anything good. By removing everything but a few frequency bins corresponding to your local pitch estimates, you will lose the texture of the input signal, what makes it sound good. In fact, if you took that to an extreme and removed everything but the one sample corresponding to the pitch and took an ifft, you would get exactly a sinusoid, which is what you are doing currently. If you wanted to do this anyway, I recommend you perform all of this by just applying a filter to your temporal signal rather than going in and out of the frequency domain, which is more expensive and cumbersome. The filter would have a small cutoff around the frequency you want to keep and that would allow for a sound with better texture as well.

However, if you already have pitch and duration estimates that you are happy with but you want to improve on the sound rendering, I suggest that you just replace your sine waves--which will always sound like silly beep-beep no matter how much you massage them--with some actual humming (or violin or flute or whatever you like) samples for each frequency in the scale. If memory is a concern or if the songs you represent do not fall into a well tempered scale (thinking middle-east song for example), instead of having a humming sample for each note of the scale, you could only have humming samples for a few frequencies. You would then derive the humming sounds at any frequency by doing a sample rate conversion from one of these humming samples. Having a few samples to pick from for doing the sample conversion would allow you to pick the one that leans to "best" ratio with the frequency you need to produce, since the complexity of sampling conversion depends on that ratio. Obviously adding a sample rate conversion would be more work and computationally demanding compared to just having a bank of samples to pick from.

Using a bank of real samples will make a big difference in the quality of what you render. It will also allow you to have realistic attacks for each new note you play.

Then yes, like you suggest, you may want to also play with the amplitude by following the instantaneous amplitude of the input signal to produce a more nuanced rendering of the song.

Last, I would also play with the duration estimates you have so that you have smoother transitions from one sound to the next. Guessing from your performance of your audio file that I enjoyed very much (beep beep beeeeep beep beeep beeeeeeeep) and the graph that you display, it looks like you have many interruptions inserted in the rendering of your song. You could avoid this by extending the duration estimates to get rid of any silence that is shorter than, say .1 second. That way you would preserver the real silences from the original song but avoid cutting off each note of your song.

Question 3

You have at least 2 problems.

First, as you surmised, your analysis has thrown away all the amplitude information of the melody portion of the original spectrum. You will need an algorithm that captures that information (and not just the amplitude of the entire signal for polyphonic input, or that of just the FFT pitch bin for any natural musical sounds). This is a non-trivial problem, somewhere between melodic pitch extraction and blind source separation.

Second, sound has timbre, including overtones and envelopes, even at a constant frequency. Your synthesis method is only creating a single sinewave, while humming probably creates a bunch of more interesting overtones, including a lot of higher frequencies than just the pitch. For a slightly more natural sound, you could try analyzing the spectrum of yourself humming a single pitch and try to recreate all those dozens of overtone sine waves, instead of just one, each at the appropriate relative amplitude, when synthesizing each frequency time-stamp in your analysis. You could also look at the amplitude envelope over time of yourself humming one short note, and use that envelope to modulate the amplitude of your synthesizer.

Question 4

use libfmp.c8 to sonify the values

import IPython.display as ipd
import libfmp.b
import libfmp.c8
data = vamp.collect(audio, samplerate, "mtg-melodia:melodia", parameters=params)
hop, melody = data['vector']
timestamps=np.arange(0,len(melody)) * float(hop)
melody_pos = melody[:]
melody_pos[melody<=0] = 0   #get rid off - vals
d = {'time': ts, 'frequency':pd.Series(melody_pos) }
df=pd.DataFrame(d)
traj = df.values
x_traj_mono = libfmp.c8.sonify_trajectory_with_sinusoid(traj, len(audio), sr, smooth_len=50, amplitude=0.8)
ipd.display(ipd.Audio(x_traj_mono+y, rate=sr))```