Audio Effects with Cepstral Processing

Nathan Ho

2023-11-23

Much like the previously discussed wavelet transforms, the cepstrum is a frequency-domain method that I see talked about a lot in the scientific research literature, but only occasionally applied to the creative arts. The cepstrum is sometimes described as “the FFT of the FFT” (although this is an oversimplification since there are nonlinear operations sandwiched in between those two transforms, and the second is really the Discrete Cosine Transform). In contrast to wavelets, the cepstrum is very popular in audio processing, most notably in the ubiquitous mel-frequency cepstral coefficients (MFCCs). Some would not consider the MFCCs a true “cepstrum,” others would say the term “cepstrum” is broad enough to encompass them. I have no strong opinion.

In almost all applications of the cepstrum, it is used solely for analysis and generally isn’t invertible. This is the case for MFCCs, where the magnitude spectrum is downsampled in the conversion to the mel scale, resulting in a loss of information. Resynthesizing audio from the cepstral descriptors commonly used in the literature is an underdetermined problem, usually tackled with machine learning or other complex optimization methods.

However, it is actually possible to implement audio effects in the MFCC domain with perfect reconstruction. You just have to keep around all the information that gets discarded, resulting in this signal chain:

Take the STFT. The following steps apply for each frame.
Compute the power spectrum (square of magnitude spectrum) and the phases.
Compute a bank of bandpass filters on the power spectrum, equally spaced on the mel-frequency scale. This is the mel spectrum, and it downsamples the magnitude spectrum, losing information.
Upsample the mel spectrum back up to full spectral envelope. Divide the magnitude spectrum by the envelope to produce the residual spectrum. (You have to add a little epsilon to the envelope to prevent zero division.)
Compute the logarithm and then the Discrete Cosine Transform of the mel spectrum to produce the MFCCs.
Perform any processing desired.
Invert step 5: take the inverse DCT and then the exponent to produce the mel spectrum.
Invert step 4: upsample the mel spectrum to the spectral envelope, and multiply it by the residual spectrum to produce the power spectrum.
Recombine the power spectrum with the phases to produce the complex spectrum.
Inverse FFT, then overlap-add to resynthesize the signal.

It’s a lot of steps, but as an extension of the basic MFCC algorithm, it’s not that much of a leap. I would not be surprised if someone has done this before, storing all residuals when computing the MFCCs so the process can be inverted, but I had difficulty finding prior work on this for the particular application of musical effects. Something similar is done in MFCC-based vocoders, where the “residual spectrum” instead replaced with speech parameters such as pitch, but I haven’t seen this done on general, non-speech signals.

I will be testing on the following mono snippet of Ed Sheeran’s “Perfect.” (If you plan on doing many listening tests on a musical signal, never use a sample of music you enjoy.)

As for the parameters: mono, 48 kHz sample rate, 2048-sample FFT buffer with Hann window and 50% overlap, 30-band mel spectrum from 20 Hz to 20 kHz.

Cepstral EQ

Because of the nonlinearities involved in the signal chain, merely multiplying the MFCCs by a constant can do some pretty strange things. Zeroing out all MFCCs has the effect of removing the spectral envelope and whitening the signal. The effect on vocal signals is pronounced, turning Ed into a bumblebee.

Multiplying all MFCCs by 2 has a subtle, hollower quality, acting as an expander for the spectral envelope.

MFCCs are signed and can also be multiplied by negative values, which inverts the phase of a cosine wave component. The effect on the signal is hard to describe:

We can apply any MFCC envelope desired. Here’s a sine wave:

Cepstral frequency shifting

Technically this would be “quefrency shifting.” This cyclically rotates the MFCCs to brighten the signal:

And here’s the downward equivalent:

Cepstral frequency scaling

Resampling the MFCCs sounds reminiscent of formant shifting. This is related to the time-scaling property of the Fourier transform: if you resample the spectrum, you’re also resampling the signal. Here’s upward scaling:

Here’s downward scaling:

Cepstral time-based effects

Here’s what happens when we freeze the MFCCs every few frames:

Lowpass filtering the MFCCs over time tends to slur speech:

Stray thoughts

I have barely scratched the surface of cepstral effects here, opting only to explore the most mathematically straightforward operations. That the MFCCs produce some very weird and very musical effects, even with such simple transformations, is encouraging.

In addition to playing with additional types of effects, it is also worthwhile to adjust the trasforms being used. The DCT as the space for the spectral envelope could be improved on. One (strange) possibility that came to mind is messing with the Multiresolution Analysis of the mel spectrum; I have no idea if that would sound interesting or not, but it’s worth a shot.

It’s possible to bypass the MFCCs and just do the DCT of the log-spectrogram. I experimented with this and found that I couldn’t get it to sound as musical as the mel-based equivalent. I believe this is because the resolution of the FFT isn’t very perceptually salient. The mel scale is in fact doing a lot of heavy lifting here.