Skip to main content

Resource: "The Tube Screamer's Secret"

A few years ago I bookmarked Boğaç Topaktaş’ 2005 article titled “The Tube Screamer’s Secret,” but today I was dismayed to discover that the domain had expired. This ensures that the page is now nearly impossible to find unless you already know the URL. I don’t normally make posts that are just a link to a third party, but this valuable resource might be forgotten otherwise. Here’s the page in the Wayback Machine:

https://web.archive.org/web/20180127031808/http://bteaudio.com/articles/TSS/TSS.html

Pulsar Synthesis

Curtis Roads’ book Microsound is a must-read for any nerdy music technologist. Despite its age, it contains many exciting techniques under the granular synthesis umbrella that still sound fresh and experimental. I recently started exploring one of these methods, called pulsar synthesis, and thought it’d be fun to write my own review of it and demonstrate how to accomplish it in SuperCollider or any similar environment.

Pulsar synthesis produces periodic oscillations by alternating between a short, arbitrary waveform called the pulsaret (after “wavelet”) and a span of silence. The reciprocal of the pulsaret’s length is the formant frequency, which is manipulated independently of the fundamental frequency by changing the speed of the pulsaret’s playback and adjusting the silence duration to maintain the fundamental. Roads calls this “pulsaret-width modulation” or PulWM. If the formant frequency dips below the fundamental frequency, the silence disappears and the pulsarets can either be truncated (regular PulWM) or overlapped (overlapped PulWM or OPulWM). Roads describes OPulWM as “a more subtle effect” due to phase cancellation.

The pulsaret signal can be anything. The book suggests the following, among others: a single sine wave cycle, multiple sine wave periods concatenated, and a bandlimited impulse (sinc function). Another option is to take the waveform from live input, transferring the timbre of input over to a quasiperiodic signal and behaving slightly more like an effect than a synth.

Read more…

"Switching" in Procedural Generation

A friend recently told me of an idea he had: what if a video game employing procedural generation could legitimately surprise its own developer? This applies to music as well – I wish I had such moments more often in my algorithmic composition work.

Pondering this question made me think back to a recent video by producer Ned Rush: Melodic Phrase Generator in Ableton Live 11. In it, the artist starts with four C notes in the piano roll and uses a variety of MIDI filters to transform them into a shimmering pandiatonic texture with lots of variation. It’s worth a watch, but one moment that especially made me think is where he pulls out an arpeggiator and maps a random control signal to its “Style” setting (timestamp). Ableton Live allows mapping control signals to discrete popup menus as well as continuous knobs, so this causes the arpeggiator to switch randomly between 18 arpeggiation styles, including Up, Down, UpDown, Converge, Diverge, Pinky Up, Chord Trigger, Random, and more.

This was the most important part of Ned Rush’s patch, as these styles individually sound good and different from each other, so switching between them randomly sounds both good and diverse. From this, we can gather a valuable technique for procedural generation: switching or collaging among many individual algorithms, each of which has considerable care put into it. Imagine a “synthesizer” (or level generator, or whatever) with many different switched modes, coated in layers of “effects,” each with their own switched modes. Even with simple, haphazard randomness controlling the modes, the system will explore numerous combinatorial varieties. If this wears itself out quickly, then the modes can be grouped into categories, with switching within a category happening on a smaller time scale (or space scale) and switching between categories happening on a larger time scale.

Switching is a handy way to aid in achieving those elusive moments where a generative system does something truly unexpected. I recommend experimenting with it next time you play with algorithmic creativity.

Integer Ring Modulation

When I think of ring modulation – or multiplication of two bipolar audio signals – I usually think of a complex, polyphonic signal being ring modulated by an unrelated sine wave, producing an inharmonic effect. Indeed, this is what “ring modulator” means in many synthesizers’ effect racks. I associate it with early electronic music and frankly find it a little cheesy, so I don’t use it often.

But if both signals are periodic and their frequencies are small integer multiples of a common fundamental, the resulting sound is harmonic. Mathematically this is no surprise, but the timbres you can get out of this are pretty compelling.

I tend to get the best results from pulse waves, in which case ring modulation is identical to an XOR gate (plus an additional inversion). Here’s a 100 Hz square wave multiplied by a second square wave that steps from 100 Hz, 200 Hz, etc. to 2000 Hz and back.

As usual, here is SuperCollider code:

(
{
    var freq, snd;
    freq = 100;
    snd = Pulse.ar(freq) * Pulse.ar(freq * LFTri.ar(0.3, 3).linlin(-1, 1, 1, 20).round);
    snd ! 2;
}.play(fadeTime: 0);
)

Try pulse-width modulation, slightly detuning oscillators for a beating effect, multiplying three or more oscillators, and filtering the oscillators prior to multiplication. There are applications here to synthesizing 1-bit music.

Credit goes to Sahy Uhns for showing me this one some years ago.

EDIT 2023-01-12: I have learned that Dave Rossum used this technique in Trident, calling it “zing modulation.” See this YouTube video.

Feedback Integrator Networks

Giorgio Sancristoforo’s recent software noise synth Bentō is a marvel of both creative DSP and graphic design. Its “Generators” particularly caught my eye. The manual states:

Bentō has two identical sound generators, these are not traditional oscillators, but rather these generators are models of analog computing patches that solve a differential equation. The Generators are designed to be very unstable and therefore, very alive!

There’s a little block diagram embedded in the synth’s graphics that gives us a good hint as to what’s going on.

A screenshot of one portion of Bento, showing a number of knobs and a block diagram with some integrators, adders, and amplifiers. The details of the block diagram are not important for the post, it's just inspiration.

We’re looking at a variant of a state variable filter (SVF), which consists of two integrators in series in a feedback loop with gain stages in between. Sancristoforo adds an additional feedback loop around the second integrator, and what appears to be an additional parallel path (not entirely sure what it’s doing). It’s not possible for chaotic behavior to happen in an SVF without a nonlinearity, so presumably there’s a clipper (or something else?) in the feedback loops.

While thinking about possible generalized forms of this structure, I realized that the signal flow around integrators can be viewed as 2 x 2 feedback matrix. A form of cross-modulation across multiple oscillators can be accomplished with a feedback matrix of greater size. I thought, why not make it like a feedback delay network in artificial reverberation, with integrators in place of delays? And so a new type of synthesis is born: the “feedback integrator network.”

An N x N feedback integrator network consists of N parallel signals that are passed through leaky integrators, then an N x N mixing matrix. First-order highpass filters are added to block dc, preventing the network from blowing up and getting stuck at -1 or +1. The highpass filters are followed by clippers. Finally, the clipper outputs are added back to the inputs with single-sample delays in the feedback path. I experimented with a few different orders in the signal chain, and the order presented here is the product of some trial and error. Here’s a block diagram of the 3 x 3 case:

A block diagram of a Feedback Integrator Network as described in the text.

A rich variety of sounds ranging from traditional oscillations to noisy, chaotic behavior results. Here are a few sound snippets of a 8 x 8 feedback integrator network, along with SuperCollider code. I’m using a completely randomized mixing matrix with values ranging from -1000 to 1000. Due to the nonlinearities, the scaling of the matrix is important to the sound. I’m driving the entire network with a single impulse on initialization, and I’ve panned the 8 parallel outputs across the stereo field.

// Run before booting server.
Server.default.options.blockSize = 1;

(
{
    var snd, n;
    n = 8;
    snd = Impulse.ar(0);
    snd = snd + LocalIn.ar(n);
    snd = Integrator.ar(snd, 0.99);
    snd = snd * ({ { Rand(-1, 1) * 1000 } ! n } ! n);
    snd = snd.sum;
    snd = LeakDC.ar(snd);
    snd = snd.clip2;
    LocalOut.ar(snd);
    Splay.ar(snd) * 0.3;
}.play(fadeTime: 0);
)

The four snippets above were not curated – they were the first four timbres I got out of randomization.

There’s great fun to be had by modulating the feedback matrix. Here’s the result of using smooth random LFOs.

I’ve experimented with adding other elements in the feedback loop. Resonant filters sound quite interesting. I’ve found that if I add anything containing delays, the nice squealing high-frequency oscillations go away and the outcome sounds like distorted sludge. I’ve also tried different nonlinearities, but only clipping-like waveshapers sound any good to my ears.

It seems that removing the integrators entirely also generates interesting sounds! This could be called a “feedback filter network,” since we still retain the highpass filters. Even removing the highpass filters, resulting in effectively a single-sample nonlinear feedback delay network, generates some oscillations, although less interesting than those with filters embedded.

While using no input sounds interesting enough on its own, creating a digital relative of no-input mixing, you can also drive the network with an impulse train to create tonal sounds. Due to the nonlinearities involved, the feedback integrator network is sensitive to how hard the input signal is driven, and creates pleasant interactions with its intrinsic oscillations. Here’s the same 8 x 8 network driven by a 100 Hz impulse train, again with matrix modulation:

Further directions in this area could include designing a friendly interface. One could use a separate knob for each of the N^2 matrix coefficients, but that’s unwieldy. I have found that using a fixed random mixing matrix and modulated, independent gain controls for each of the N channels produces results just as diverse as modulating the entire mixing matrix. A interface could be made by supplying N unlabeled knobs (likely less) and letting the user twist them for unpredictable and fun results.

OddVoices Dev Log 3: Pitch Contours

This is part of an ongoing series of posts about OddVoices, a singing synthesizer I’ve been building. OddVoices has a Web version, which you can now access at the newly registered domain oddvoices.org.

Unless we’re talking about pitch correction settings, the pitch of a human voice is generally not piecewise constant. A big part of any vocal style is pitch inflections, and I’m happy to say that these have been greatly improved in OddVoices based on studies of real pitch data. But first, we need…

Pitch detection

A robust and high-precision monophonic pitch detector is vital to OddVoices for two reasons: first, the input vocal database needs to be normalized in pitch during the PSOLA analysis process, and second, the experiments we conduct later in this blog post require such a pitch detector.

There’s probably tons of Python code out there for pitch detection, but I felt like writing my own implementation to learn a bit about the process. My requirements are that the pitch detector should work on speech signals, have high accuracy, be as immune to octave errors as possible, and not require an expensive GPU or a massive dataset to train. I don’t need real time capabilities (although reasonable speed is desirable), high background noise tolerance, or polyphonic operation.

I shopped around a few different papers and spent long hours implementing different algorithms. I coded up the following:

  1. Cepstral analysis [Noll1966]

  2. Autocorrelation function (ACF) with prefiltering [Rabiner1977]

  3. Harmonic Product Spectrum (HPS)

  4. A simplified variant of Spectral Peak Analysis (SPA) [Dziubinski2004]

  5. Special Normalized Autocorrelation (SNAC) [McLeod2008]

  6. Fourier Approximation Method (FAM) [Kumaraswamy2015]

There are tons more algorithms out there, but these were the ones that caught my eye for some reason or another. All methods have their own upsides and downsides, and all of them are clever in their own ways. Some algorithms have parameters that can be tweaked, and I did my best to experiment with those parameters to try to maximize results for the test dataset.

I created a test dataset of 10000 random single-frame synthetic waveforms with fundamentals ranging from 60 Hz to 1000 Hz. Each one has harmonics ranging up to the Nyquist frequency, and the amplitudes of the harmonics are randomized and multiplied by \(1 / n\) where \(n\) is the harmonic number. Whether this is really representative of speech is not an easy question, but I figured it would be a good start.

I scored each algorithm by how many times it produced a pitch within a semitone of the actual fundamental frequency. We’ll address accuracy issues in a moment. The scores are:

Algorithm

Score

Cepstrum

9961/10000

SNAC

9941/10000

FAM

9919/10000

Simplified SPA

9789/10000

ACF

9739/10000

HPS

7743/10000

All the algorithms performed quite acceptably with the exception of the Harmonic Product Spectrum, which leads me to conclude that HPS is not really appropriate for pitch detection, although it does have other applications such as computing the chroma [Lee2006].

What surprised me most is that one of the simplest algorithms, cepstral analysis, also appears to be the best! Confusingly, a subjective study of seven pitch detection algorithms by McGonegal et al. [McGonegal1977] ranked the cepstrum as the 2nd worst. Go figure.

I hope this comparison was an interesting one in spite of how small and unscientific the study is. Be reminded that it is always possible that I implemented one or more of the algorithms wrong, didn’t tweak it in the right way, or didn’t look much into strategies for improving it.

The final algorithm

I arrived at the following algorithm by crossbreeding my favorite approaches:

  1. Compute the “modified cepstrum” as the absolute value of the IFFT of \(\log(1 + |X|)\), where \(X\) is the FFT of a 2048-sample input frame \(x\) at a sample rate of 48000 Hz. The input frame is not windowed – for whatever reason that worked better!

  2. Find the highest peak in the modified cepstrum whose quefrency is above a threshold derived from the maximum frequency we want to detect.

  3. Find all peaks that exceed 0.5 times the value of the highest peak.

  4. Find the peak closest to the last detected pitch, or if there is no last detected pitch, use the highest peak.

  5. Convert quefrency into frequency to get the initial estimate of pitch.

  6. Recompute the magnitude spectrum of \(x\), this time with a Hann window.

  7. Find the values of the three bins around the FFT peak at the estimated pitch.

  8. Use an artificial neural network (ANN) on the bin values to interpolate the exact frequency.

The idea of the modified cepstrum, i.e. adding 1 before taking the logarithm of the magnitude spectrum, is borrowed from Philip McLeod’s dissertation on SNAC, and prevents taking the logarithm of values too close to zero. The peak picking method is also taken from the same resource.

The use of an artificial neural network to refine the estimate is from the SPA paper [Dziubinski2004]. The ANN in question is a classic feedforward perceptron, and takes as input the magnitudes of three FFT bins around a peak, normalized so the center bin has an amplitude of 1.0. This means that the center bin’s amplitude is not needed and only two input neurons are necessary. Next, there is a hidden layer with four neurons and a tanh activation function, and finally an output layer with a single neuron and a linear activation function. The output format of the ANN ranges from -1 to +1 and indicates the offset of the sinusoidal frequency from the center bin, measured in bins.

The ANN is trained on a set of synthetic data similar to the test data described above. I used the MLPRegressor in scikit-learn, set to the default “adam” optimizer. The ANN works astonishingly well, yielding average errors less than 1 cent against my synthetic test set.

In spite of the efforts to find a nearly error-free pitch detector, the above algorithm still sometimes produces errors. Errors are identified as pitch data points that exceed a manually specified range. Errors are corrected by linearly interpolating the surrounding good data points.

Source code for the pitch detector is in need of some cleanup and is not yet publicly available as of this writing, but should be soon.

Vocal pitch contour phenomena

I’m sure the above was a bit dry for most readers, but now that we’re armed with an accurate pitch detector, we can study the following phenomena:

  1. Drift: low frequency noise from 0 to 6 Hz [Cook1996].

  2. Jitter: high frequency noise from 6 to 12 Hz.

  3. Vibrato: deliberate sinusoidal pitch variation.

  4. Portamento: lagging effect when changing notes.

  5. Overshoot: when moving from one pitch to another, the singer may extend beyond the target pitch and slide back into it [Lai2009].

  6. Preparation: when moving from one pitch to another, the singer may first move away from the target pitch before approaching it.

There is useful literature on most of these six phenomena, but I also wanted to gather my own data and do a little replication work. I had a gracious volunteer sing a number of melodies consisting of one or two notes, with and without vibrato, and I ran them through my pitch detector to determine the pitch contours.

Drift and jitter: In his study, Cook reported drift of roughly -50 dB and jitter at about -60 to -70 dB. Drift has a roughly flat spectrum and jitter has a sloping spectrum of around -8.5 dB per octave. My data is broadly consistent with these figures, as can be seen in the below spectra.

A diagram labeled "Measured drift and jitter, frequency domain" of four magnitude spectra with frequencies from 0 to 12 Hz. The specta are quite complicated but a general downward slope is visible.

Drift and jitter are modeled as \(f \cdot (1 + x)\) where \(f\) is the static base frequency and \(x\) is the deviation signal. The ratio \(x / f\) is treated as an amplitude and converted to decibels, and this is what is meant by drift and jitter having a decibel value.

Cook also notes that drift and jitter also exhibit a small peak around the natural vibrato frequency, here around 3.5 Hz. Curiously, I don’t see any such peak in my data.

Synthesis can be done with interpolated value noise for drift and “clipped brown noise” for jitter, added together. Interpolated value noise is downsampled white noise with sine wave segment interpolation. Clipped brown noise is defined as a random walk that can’t exceed the range [-1, +1].

Vibrato is, not surprisingly, a sine wave LFO. However, a perfect sine wave sounds pretty unrealistic. Based on visual inspection of vibrato data, I multiplied the sine wave by random amplitude modulation with interpolated value noise. The frequency of the interpolated value noise is the same as the vibrato frequency.

A plot labeled "Measured vibrato, time domain." The X-axis is a time span of four seconds, and the Y-axis is frequency. The vibrato is quite smooth, with a peak-to-peak amplitude of about 10 Hz and a frequency of about 4 Hz, but the peaks and troughs vary.A plot labeled "Synthetic vibrato, time domain." Again, the X-axis is a time span of four seconds, and the Y-axis is frequency. It looks similar to the measured plot but a bit smoother.

Also note that vibrato takes a moment to kick in, which is simple enough to emulate with a little envelope at the beginning of each note.

Portamento, overshoot, and preparation I couldn’t find much research on, so I sought to collect a good amount of data on them. I asked the singer to perform two-note melodies consisting of ascending and descending m2, m3, P4, P5, and P8, each four times, with instructions to use “natural portamento.” I then ran all the results through the pitch tracker and visually measured rough averages of preparation time, preparation amount, portamento time, overshoot time, and overshoot amount. Here’s the table of my results.

Interval

Prep. time

Prep. amount

Port. time

Over. time

Over. amount

m3 ascending

0.1

0.7

0.15

0.2

0.5

m3 descending

no preparation

0.1

0.3

1

P4 ascending

no preparation

0.1

0.3

0.5

P4 descending

no preparation

0.2

0.2

1

P5 ascending

0.1

0.5

0.2

no overshoot

P5 descending

no preparation

0.2

0.1

1

P8 ascending

0.1

1

0.25

no overshoot

P8 descending

no preparation

0.15

0.1

1.5

As one might expect, portamento time gently increases as the interval gets larger. There is no preparation for downward intervals, and spotty overshoot for upward intervals, both of which make some sense physiologically – you’re much more likely to involuntarily relax in pitch rather than tense up. Overshoot and preparation amounts have a slight upward trend with interval size. The overshoot time seems to have a downward trend, but overshoot measurement is pretty unreliable.

Worth noting is the actual shape of overshoot and preparation.

A plot titled "Portamento, ascending m3" showing four measured pitch signals. The X-axis is time and the Y-axis is frequency measured in MIDI note, which starts at about 52 and jumps to about 55. Preparation is only clearly visible for two of the signals, but it's about 70 cents.A plot titled "Portamento, descending m3" showing four measured pitch signals. The X-axis is time and the Y-axis is frequency measured in MIDI note, which starts at about 55 and jumps to about 52. Overshoot is clearly visible for three of the four signals, nearly a semitone and a half.

In OddVoices, I model these three pitch phenomena by using quarter-sine-wave segments, and assuming no overshoot when ascending and no preparation when descending.

Further updates

Pitch detection and pitch contours consumed most of my time and energy recently, but there are a few other updates too.

As mentioned earlier, I registered the domain oddvoices.org, which currently hosts a copy of the OddVoices Web interface. The Web interface itself looks a little bland – I’d even say unprofessional – so I have plans to overhaul it especially as new parameters are on the way.

The README has been heavily updated, taking inspiration from the article “Art of README”. I tried to keep it concise and prioritize information that a casual reader would want to know.

References

[Noll1966]

Noll, A. Michael. 1966. “Cepstrum Pitch Determination.

[Rabiner1977]

Rabiner, L. 1977. “On the Use of Autocorrelation Analysis for Pitch Detection.”

[Dziubinski2004] (1,2)

Diubinski, M. and Kostek, B. 2004. “High Accuracy and Octave Error Immune Pitch Detection Algorithms.”

[McLeod2008]

McLeod, Philip. 2008. “Fast, Accurate Pitch Detection Tools for Music Analysis.”

[Kumaraswamy2015]

Kumaraswamy, B. and Poonacha, P. G. 2015. “Improved Pitch Detection Using Fourier Approximation Method.”

[Cook1996]

Cook, P. R. 1996. “Identification of Control Parameters in an Articulatory Vocal Tract Model with Applications to the Synthesis of Singing.”

[Lai2009]

Lai, Wen-Hsing. 2009. “An F0 Contour Fitting Model for Singing Synthesis.”

[Lee2006]

Lee, Kyogu. 2006. “Automatic Chord Recognition from Audio Using Enhanced Pitch Class Profile.”

[McGonegal1977]

McGonegal, Carol A. et al. 1977. “A Subjective Evaluation of Pitch Detection Methods Using LPC Synthesized Speech.”

OddVoices Dev Log 2: Phase and Volume

This is the second in an ongoing series of dev updates about OddVoices, a singing synthesizer I’ve been developing over the past year. Since we last checked in, I’ve released version 0.0.1. Here are some of the major changes.

New voice!

Exciting news: OddVoices now has a third voice. To recap, we’ve had Quake Chesnokov, a powerful and dark basso profondo, and Cicada Lumen, a bright and almost synth-like baritone. The newest voice joining us is Air Navier (nav-YEH), a soft, breathy alto. Air Navier makes a lovely contrast to the two more classical voices, and I’m imagining it will fit in great in a pop or indie rock track.

Goodbye GitHub

OddVoices makes copious use of Git LFS to store original recordings for voices, and this caused some problems for me this past week. GitHub’s free tier caps the amount of Git LFS storage and the monthly download bandwidth to 1 gigabyte. It is possible to pay $5 to add 50 GB to both storage and bandwidth limits. These purchases are “data packs” and are orthogonal to GitHub Pro.

What’s unfortunate is that all downloads by anyone (including those on forks) contribute to the monthly download bandwidth, and even worse, downloads from GitHub Actions do also. I am easily running CI dozens of times per week, and multiplied by the gigabyte or so of audio data, the plan is easily maxed out.

A free GitLab account has a much more workable storage limit of 10 GB, and claims unlimited bandwidth for now. GitLab it is. Consider this a word of warning for anyone making serious use of Git LFS together with GitHub, and especially GitHub Actions.

Goodbye MBR-PSOLA

OddVoices, taking after speech synthesizers of the 90’s, is based on concatenation of recorded segments. These segments are processed using PSOLA, which turns them into a sequence of frames (grains), each for one pitch period. PSOLA then allows manipulation of the segment in pitch, time, and formants, and sounds pretty clean. The synthesis component is also computationally efficient.

One challenge with a concatenative synthesizer is making the segments blend together nicely. We are using a crossfade, but a problem arises – if the phases of the overlapping frames don’t approximately match, then unnatural croaks and “doubling” artifacts happen.

There is a way to solve this: manually. If one lines up the locations of the frames so they are centered on the exact times when the vocal folds close (the so-called “glottal closure instant” or GCI), the phases will match. Since it’s difficult to find the GCI from a microphone signal, an electroglottograph (EGG) setup is typically used. I don’t have an EGG on hand, and I’m working remotely with singers, so this solution has to be ruled out.

A less daunting solution is to use FFT processing to make all phases zero, or set every frame to minimum phase. These solve the phase mismatch problem but sound overtly robotic and buzzy. (Forrest Mozer’s TSI S14001A speech synthesis IC, memorialized in chipspeech’s Otto Mozer, uses the zero phase method – see US4214125A.) MBR-PSOLA softens the blows of these methods by using a random set of phases that are fixed throughout the voice database. Dutoit recommends only randomizing the lower end of the spectrum while leaving the highs untouched. It sounds pretty good, but there is still an unnatural hollow and phasey quality to it.

I decided to search around the literature and see if there’s any way OddVoices can improve on MBR-PSOLA. I found [Stylianou2001], which seems to fit the bill. It recommends computing the “center” of a grain, then offsetting the frame so it is centered on that point. The center is not the exact same as the GCI, but it acts as a useful stand-in. When all grains are aligned on their centers, their phases should be roughly matched too – and all this happens without modifying the timbre of the voice, since all we’re doing is a time offset.

I tried this on the Cicada voice, and it worked! I didn’t conduct any formal listening experiment, but it definitely sounded clearer and lacking the weird hollowness of the MBROLA voice. Then I tried it on the Quake voice, and it sounded extremely creaky and hoarse. This is the result of instabilities in the algorithm, producing random timing offsets for each grain.

Frame adjustment

Let \(x[t]\) be a sampled quasiperiodic voice signal with period \(T\), with a sample rate of \(f_s\). We round \(T\) to an integer, which works well enough for our application. Let \(w[t]\) be a window function (I use a Hann window) of length \(2T\). Brackets are zero-indexed, because we are sensible people here.

The PSOLA algorithm divides \(x\) into a number of frames of length \(2T\), where the \(n\)-th frame is given by \(s_n[t] = w[t] x[t + nT]\).

Stylianou proposes the “differentiated phase spectrum” center, or DPS center, which is computed like so:

\begin{equation*} \eta = \frac{T}{2\pi} \arg \sum_{i = -T}^{T - 1} s_n^2[t] e^{2 \pi j t / T} \end{equation*}

\(\eta\) is here expressed in samples. The DPS center is not the GCI. It’s… something else, and it’s admitted in the paper that it isn’t well defined. However, it is claimed that it will be close enough to the GCI, hopefully by a near-constant offset. To normalize a frame on its DPS center, we recalculate the frame with an offset of \(\eta\): \(s'_n[t] = w[t] x[t + nT + \text{round}(\eta)]\).

The paper also discusses the center of gravity of a signal as a center close to the GCI. However, the center of gravity is less robust than the DPS center, as it can be shown that the center can be computed from just a single bin in the discrete Fourier transform, whereas the DPS center involves the entire spectrum.

Here’s where we go beyond the paper. As discussed above, for certain signals \(\eta\) can be noisy, and using this algorithm as-is can result in audible jitter in the result. The goal, then, is to find a way to remove noise from \(\eta\).

After many hours of experimenting with different solutions, I ended up doing a lowpass filter on \(\eta\) to remove high-frequency noise. A caveat is that \(\eta\) is a circular value that wraps around with period \(T\), and performing a standard lowpass filter will smooth out discontinuities produced by wrapping, which is not what we want. The trick is to use an encoding common in circular statistics, and especially in machine learning: convert it to sine and cosine, perform filtering on both signals, and convert it back with atan2. A rectangular FIR filter worked perfectly well for my application.

Overall the result sounds pretty good. There are still some minor issues with it, but I hope to iron those out in future versions.

Volume normalization

I encountered two separate but related issues regarding the volume of the voices. The first is that the voices are inconsistent in volume – Cicada was much louder than the other two. The second, and the more serious of the two, is that segments can have different volumes when they are joined, and this results in a “choppy” sound with discontinuities.

I fixed global volume inconsistency by taking the RMS amplitude of the entire segment database and normalizing it to -20 dBFS. For voices with higher dynamic range, this caused some of the louder consonants to clip, so I added a safety limiter that ensures the peak amplitude of each frame is no greater than -6 dBFS.

Segment-level volume inconsistency can be addressed by examining diphones that join together and adjusting their amplitudes accordingly. Take the phoneme /k/, and gather a list of all diphones of the form k* and *k. Now inspect the amplitudes at the beginning of k* diphones, and the amplitudes at the end of *k diphones. Take the RMS of all these amplitudes together to form the “phoneme amplitude.” Repeat for all other phonemes. Then, for each diphone, apply a linear amplitude envelope so that the beginning frames match the first phoneme’s amplitude and the ending frames match the second phoneme’s amplitude. The result is that all joined diphones will have a matched amplitude.

Conclusion

The volume normalization problem in particular taught me that developing a practical speech or singing synthesizer requires a lot more work than papers and textbooks might make you think. Rather, the descriptions in the literature are only baselines for a real system.

More is on the way for OddVoices. I haven’t yet planned out the 0.0.2 release, but my hope is to work on refining the existing voices for intelligibility and naturalness instead of adding new ones.

References

[Stylianou2001]

Stylianou, Yannis. 2001. “Removing Linear Phase Mismatches in Concatenative Speech Synthesis.”

OddVoices Dev Log 1: Hello World!

The free and open source singing synthesizer landscape has a few projects worth checking out, such as Sinsy, eCantorix, meSing, and MAGE. While each one has its own unique voice and there’s no such thing as a bad speech or singing synthesizer, I looked into all of them and more and couldn’t find a satisfactory one for my musical needs.

So, I’m happy to announce OddVoices, my own free and open source singing synthesizer based on diphone concatenation. It comes with two English voices, with more on the way. If you’re not some kind of nerd who uses the command line, check out OddVoices Web, a Web interface I built for it with WebAssembly. Just upload a MIDI file and write some lyrics and you’ll have a WAV file in your browser.

OddVoices is based on MBR-PSOLA, which stands for Multi-Band Resynthesis Pitch Synchronous Overlap Add. PSOLA is a granular synthesis-based algorithm for playback of monophonic sounds such that the time, formant, and pitch axes can be manipulated independently. The MBR part is a slight modification to PSOLA that prevents unwanted phase cancellation when crossfading between concatenated samples, and solves other problems too. For more detail, check out papers from the MBROLA project. The MBROLA codebase itself has some tech and licensing issues I won’t get into, but the algorithm is perfect for what I want in a singing synth. Note that OddVoices doesn’t interface with MBROLA.

I’ll use this post to discuss some of the more interesting challenges I had to work on in the course of the project so far. This is the first in a series of posts I will be making about the technical side of OddVoices.

Vowel mergers

OddVoices currently only supports General American English (GA), or more specifically the varieties of English that I and the singers speak. I hope in the future that I can correct this bias by including other languages and other dialects of English.

When assembling the list of phonemes, the cot-caught merger immediately came up. I decided to merge them, and make /O/ and /A/ aliases except for /Or/ and /Ar/ (here using X-SAMPA). To reduce the number of phonemes and therefore phoneme combinations, I represent /Or/ internally as /oUr/.

A more interesting merger concerns the problem of the schwa. In English, the schwa is used to represent an unstressed syllable, but the actual phonetics of that syllable can vary wildly. In singing, a syllable that would be unstressed in spoken English can be drawn out for multiple seconds and become stressed. The schwa isn’t actually sung in these cases, and is replaced with another phoneme. As one of the singers put it, “the schwa is a big lie.”

This matters when working with the CMU Pronouncing Dictionary, which I’m using for pronouncing text. Take a word like “imitate” – the second syllable is unstressed, and the CMUDict transcribes it as a schwa. But when sung, it’s more like /I/. This is simply a limitation of the CMUDict that I don’t have a good solution for. In the end I merge /@/ with /V/, since the two are closely related in GA. Similarly, /3`/ and /@`/ are merged, and the CMUDict doesn’t even distinguish those.

Real-time vs. semi-real-time operation

A special advantage of OddVoices over alternative offerings is that it’s built from scratch to work in real time. That means that it can become a UGen for platforms like SuperCollider and Pure Data, or even a VST plugin in the far future. I have a SuperCollider UGen in the works, but there’s some tricky engineering work involving communication between RT and NRT threads that I haven’t tackled yet. Stay tuned.

There is a huge caveat to real time operation: singers don’t operate in perfect real time! To see why, imagine the lyrics “rice cake,” sung with two half notes. The final /s/ in “rice” has to happen before the initial /k/ in “cake,” and the latter happens right on the third beat, so the singer has to anticipate the third beat with the consonant /s/. But in MIDI and real-time keyboard playing, there is no way to predict when the note off will happen until the third beat has already arrived.

VOCALOID handles this by being its own DAW with a built-in sequencer, so it can look ahead as much as it needs. chipspeech and Alter/Ego work in real time. In their user guides, they ask the user to shorten every MIDI note to around 50%-75% of its length to accommodate final consonant clusters. If this is not done, a phenomenon I call “lyric drift” happens and the lyrics misalign from the notes.

OddVoices supports two possible modes: true real-time mode and semi-real-time mode. In true real-time mode, we don’t know the durations of notes, so we trigger the final consonant cluster on a note off. Like chipspeech and Alter/Ego, this requires manual shortening of notes to prevent lyric drift. Alternatively, OddVoices supports a semi-real-time mode where every note on is accompanied by the duration of the note. This way OddVoices can predict the timing of the final consonant cluster, but still operate in otherwise real-time.

Semi-real-time mode is used in OddVoices’ MIDI frontend, and can also be used in powerful sequencing environments like SC and Pd by sending a “note length” signal along with the note on trigger. I think it’s a nice compromise between the constraints of real-time and the omniscience of non-real-time.

Syllable compression

After I implemented semi-real-time mode, another problem remained that reared its head in fast singing passages. This happens when, say, the lyric “rice cake” is sung very quickly, and the diphones _r raI aIs (here using X-SAMPA notation), when concatenated, will be longer than the note length. The result is more lyric drift – the notes and the lyrics diverge.

The fix for this was to peek ahead in the diphone queue and find the end of the final consonant cluster, then add up all the segment lengths from the beginning to that point. This is how long the entire syllable would last. This is then compared to the note length, and if it is longer, the playback speed is increased for that syllable to compensate. In short, consonants have to be spoken quickly in order to fit in quickly sung passages.

The result is still not entirely satisfactory to my ears, and I plan to improve it in future versions of the software. Syllable compression is of course only available in semi-real-time mode.

Syllable compression is evidence that fast singing is phonetically quite different from slow singing, and perhaps more comparable to speech.

Stray thoughts

This is my second time using Emscripten and WebAssembly in a project, and I find it an overall pleasant technology to work with (especially with embind for C++ bindings). I did run into an obstacle, however, which was that I couldn’t figure out how to compile libsndfile to WASM. The only feature I needed was writing a 16-bit mono WAV file, so I dropped libsndfile and wrote my own code for that.

I was surprised by the compactness of this project so far. The real-time C++ code adds up to 1,400 lines, and the Python offline analysis code only 600.

Hearing Graphs

My latest project is titled Hearing Graphs, and you can access it by clicking on the image below.

A screenshot of the project, with an undirected graph to the left and music notation to the right.

Hearing Graphs is a sonification of graphs using the graph spectrum – the eigenvalues of the graph’s adjacency matrix. Both negative and positive eigenvalues are represented by piano samples, and zero eigenvalues are interpreted with a bass drum. The multiplicity of each eigenvalue is represented by hitting notes multiple times.

Most sonifications establish some audible relationship between the source material and the resulting audio. This one remains mostly incomprehensible, especially to a general audience, so I consider it a failed experiment in that regard. Still, it was fun to make.

Moisture Bass

If you haven’t heard of the YouTube channel Bunting, it gets my strong recommendation. Bunting creates excellent style imitations of experimental bass music artists and breaks them down with succinct explanations. Notable is his minimal tooling: he uses mostly Ableton Live stock plugins and the free and open source wavetable synth Vital.

His latest tutorial, mimicking the style of the artist Resonant Language, contains several bass sounds with a property he calls “moisture” (timestamp). These bass sounds are created by starting with a low saw wave, boosting the highs, and running the result through Ableton Live’s vocoder set on “Modulator” mode. According to the manual, this enables self-vocoding, where the same signal is the modulator and carrier. An abstract view of a vocoder would suggest that this does little or nothing to the saw wave other than change its spectral tilt, but the reality is much more interesting. Hear for yourself an EQ’d saw wave before and after self-vocoding:

A closer inspection of the latter waveform shows why the self-vocoded saw sounds the way it does. Here’s a single pitch period:

A single pitch period of the second audio. It is mostly smooth but contains some very spiky oscillations in one portion of the waveform. The oscillations have a sudden onset and rapidly slow down.

The discontinuity in the saw signal is decorated with a chirp, or a sine wave that rapidly descends in frequency. This little 909 kick drum every pitch period is responsible for the “moisture” sound. Certainly there have been no studies on the psychoacoustics of moisture bass (for lack of a better term), but I suspect that it mimics dispersive behavior, lending a vaguely acoustic sound.

The chirp originates from the bandpass filters in the vocoder. The frequencies of the vocoder are exponentially spaced, so the bandpass filters have to increase in bandwidth for higher frequencies to cover the gaps. Larger bandwidth means lower Q, and lower Q reduces the ring time in the filter’s impulse response. The result is that low frequencies ring longer when the vocoder is pinged and high frequencies ring shorter. Mix them all together, and you have an impulse response resembling a chirp.

Self-vocoding with exponentially spaced bands is clever, but it isn’t the only way to create this effect. One option is to eliminate the vocoding part and use only the exponentially spaced bandpass filters, like an old-school filter bank. This sounds just like self-vocoding but requires fewer bandpass filters to work. In my experiments, I found that putting the resulting signal through nonlinear distortion is necessary to bring out the moisture property.

A more direct approach is to use wavefolding on a curved signal. The slope of the input signal controls the rate that it scrubs through wavefolding function, and thus controls the frequency of the resulting triangle wave. By modulating the slope from high in absolute value down to zero, a triangle wave descending in frequency is created. This is best explained visually:

A graph of two functions. One is labeled "curved signal," looking like a downward saw wave but with decreasing slope as it reaches a trough. The other is labeled "wavefolder output," showing the result of wavefolding on the curved signal, which displays triangle wave oscillations that onset at each pitch period and quickly decrease in frequency.

And here’s how it sounds: