Programmatic mix analysis of stereo audio files - is bass panned to one channel?

https://stackoverflow.com/questions/21761741

11-10-2022
|

Question

I want to analyze my music collection, which is all CD audio data (stereo 16-bit PCM, 44.1kHz). What I want to do is programmatically determine if the bass is mixed (panned) only to one channel. Ideally, I'd like to be able to run a program like this

mono-bass-checker music.wav

And have it output something like "bass is not panned" or "bass is mixed primarily to channel 0".

I have a rudimentary start on this, which in pseudocode looks like this:

binsize = 2^N # define a window or FFT bin as a power of 2
while not end of audio file:
    read binsize samples from audio file
    de-interleave channels into two separate arrays
    chan0_fft_result = fft on channel 0 array
    chan1_fft_result = fft on channel 1 array
    for each index i in (number of items in chanX_fft_result/2):
        freqency_bin = i * 44100 / binsize
        # define bass as below 150 Hz (and above 30 Hz, since I can't hear it)
        if frequency_bin > 150 or frequency_bin < 30 ignore
        magnitude = sqrt(chanX_fft_result[i].real^2 + chanX_fft_result[i].complex^2)

I'm not really sure where to go from here. Some concepts I've read about but are still too nebulous to me:

Window function. I'm currently not using one, just naively reading from the audio file 0 to 1024, 1025 to 2048, etc (for example with binsize=1024). Is this something that would be useful to me? And if so, how does it get integrated into the program?
Normalizing and/or scaling of the magnitude. Lots of people do this for the purpose of making pretty spectograms, but do I need to do that in my case? I understand human hearing roughly works on a log scale, so perhaps I need to massage the magnitude result in some way to filter out what I wouldn't be able to hear anyway? Is something like A-weighting relevant here?
binsize. I understand that a bigger binsize gets me more frequency bins... but I can't decide if that helps or hurts in this case.

I can generate a "mono bass song" using sox like this:

sox -t null /dev/null --encoding signed-integer --bits 16 --rate 44100 --channels 1 sine40hz_mono.wav synth 5.0 sine 40.0
sox -t null /dev/null --encoding signed-integer --bits 16 --rate 44100 --channels 1 sine329hz_mono.wav synth 5.0 sine 329.6
sox -M sine40hz_mono.wav sine329hz_mono.wav sine_merged.wav

In the resulting "sine_merged.wav" file, one channel is pure bass (40Hz) and one is non-bass (329 Hz). When I compute the magnitude of bass frequencies for each channel of that file, I do see a significant difference. But what's curious is that the 329Hz channel has non-zero sub-150Hz magnitude. I would expect it to be zero.

Even then, with this trivial sox-generated file, I don't really know how to interpret the data I'm generating. And obviously, I don't know how I'd generalize to my actual music collection.

FWIW, I'm trying to do this with libsndfile and fftw3 in C, based on help from these other posts:

Solution

Not using a window function (the same as using a rectangular window) will splatter some of the high frequency content (anything not exactly periodic in your FFT length) into all other frequency bins of an FFT result, including low frequency bins. (Sometimes this is called spectral "leakage".)

To minimize this, try applying a window function (von Hann, etc.) before the FFT, and expect to have to use some threshold level, instead of expecting zero content in any bins.

Also note that the bass notes from many musical instruments can generate some very powerful high frequency overtones or harmonics that will show up in the upper bins on an FFT, so you can't preclude a strong bass mix from the presence of a lot of high frequency content.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow