The ultimate guide for sound features and their applications
In this article we will try to go deeper into sound features describing the Mel-Spectrogram and MFCC features and their applications, after previously discussing the Spectrogram features at the guide for Spectrogram features.
In this article we’ll cover the following topics:
- Why do we need features other than the Spectrogram features?
- Mel-Spectrogram (Filter Bank) features
- Mel-frequency cepstral coefficients (MFCCs) features
- Conclusion
Why do we need features other than the Spectrogram features?
Spectrogram features most important advantage is that: It captures the sound properties providing us with the ability to recreate it.
But its main disadvantage is that:
- It treats all the frequency ranges the same, which means it doesn’t mimic our ears behavior.
- Our ear focuses on the differences between low frequency sounds more than it does for high frequency sounds.
So due to these disadvantages other features will be needed for some of the tasks.
Mel-Spectrogram (Filter Bank) features
This where Mel-Spectrogram (Filter bank) features shine. The point here is to try and mimic the hearing of our sounds. Where our ears can distinguish low frequencies more than it can for high frequencies. So, we will apply filters to window our Spectrogram, where the width of the filters will increase as frequency increases.
One of the most common filter banks here is the Hanning window as in the STFT. Triangular filter banks are also popular.
The main advantage of these features is:
- They mimic our ears behavior, so they are great for recognition tasks.
But their main disadvantage is:
- They are heavily correlated, so system that are susceptible to correlated inputs won’t be able to utilize them.
Mel-frequency cepstral coefficients (MFCCs) features
Here we start from the previously computed Mel-Spectrogram. Where log is applied to this Mel-spectrogram.
Then Discrete Cosine Transform (DCT) is applied to this log Mel-spectrogram. DCT is simply representing a signal in terms of coefficients of cosine waves.
The main advantage of the DCT over the DFT that makes it suitable for compression is that it focuses on the most important features at its first coefficients.
Then as we go further in DCT coefficients, their importance starts to decrease gradually. So, the point here is that using the DCT focuses on the most important features at the beginning, that’s why it is used in compressing data.
So, now we can choose a sufficient number of the produced DCT coefficients to represent our signal.
Most common numbers used are 13 and 40 features.
Also, deltas and accelerators can be used where the delta is the coefficient squared. Accelerators are deltas of the deltas.
MFCCs are less correlated than Mel-spectrogram. So, we can choose MFCCs over Mel-spectrogram if our system is susceptible to correlated inputs.
Pre-emphasis can be nearly used as or have the same effect of a mean normalization for the signal.
It is done before extracting any features.
y(t)=x(t)-αx(t-1)
So now after learning about the MFCCs features.
Their main advantage is:
- They are a compressed less correlated version of the Mel-Spectrogram features.
Their main disadvantage is:
- The exact same point as in the advantage as being a compressed version means losing some information.
Conclusion
Most common frequency domain features are:
- Spectrogram:
- Used for sound properties or nature retrieval tasks.
- Mel-spectrogram (Fbanks):
- Used for sound recognition tasks with systems that are not susceptible to correlated inputs such as neural networks.
- MFCCs:
- Used for sound recognition tasks with systems that are susceptible to correlated inputs such as HMMs.