Nathan Ho (Posts about projects)

Nathan Ho - Haywire Frontier

Nathan Ho — Sun, 03 Sep 2023 00:57:37 GMT

My first full-length solo album, Haywire Frontier, is releasing on Saturday, September 9th on the Japanese label Tokinogake. It is available for preorder now, and you can listen to the opening track “Trickster Deity.”

Here are the liner notes:

In 2008, at the age of 11, I created Googology Wiki on my parents’ computer. “Googology” is word for the study of large numbers and fast-growing functions, deriving from the 9-year-old Milton Sirotta’s coinage of the term “googol.” The website was never meant to go beyond my personal use, and I gradually drifted away from it. Fifteen years later, it has grown to tens of thousands of articles and a community of hundreds of active users.

Haywire Frontier is a 40-minute musical tribute to a strange corner of amateur mathematics whose growth I somewhat-inadvertently catalyzed, with rhythmic and formal material deriving from Georg Cantor’s “ordinal number” system, integral to the study of large numbers.

The album was sequenced and synthesized entirely in SuperCollider with no samples, external hardware, or third-party plugins.

Credits:

John Tejada, mastering
Isa Hanssen (Instagram), cover art
Special thanks to a0n0, Charlie Burgin (Sahy Uhns), William Fields, RM Francis, Joonas Siren (Forces), Ben Tillotson, Nathan Turczan.

I expect to write about this project in the near future. Thank you for listening, and for all your support.

Audio Texture Resynthesis

Nathan Ho — Tue, 25 Apr 2023 19:58:19 GMT

Left: spectrogram of a child singing. Right: spectrogram of resynthesized audio.

Background

I was alerted to audio texture resynthesis methods by a student of mine who was interested in the collaborative work of researcher Vincent Lostanlen, musician Florian Hecker, and several others [Lostanlen2019] [Lostanlen2021] [Andén2019] [Muradeli2022]. Their efforts are built on an analysis method called “Joint Time-Frequency Scattering” (JTFS) based on the Continuous Wavelet Transform. In an attempt to understand the work better, I binged a wavelet transform textbook, [1] implemented a simplified version of JTFS-based resynthesis, and and briefly exchanged emails with Lostanlen. His helpful answers gave me the impression is that while JTFS is a powerful analysis technique, resynthesis was more of a side project and there are ways to accomplish similar effects that are more efficient and easier to code without compromising too much on musicality.

Audio texture resynthesis has some history in computer music literature [Schwartz2010], and some researchers have used resynthesis to help understand how the human brain processes audio [McDermott2011].

After some experimentation with these methods, I found that it’s not too hard to build a simple audio texture resynthesizer that exhibits clear musical potential. In this blog post, I’ll walk through a basic technique for making such a system yourself. There won’t be any novel research here, just a demonstration of a minimum viable resynthesizer and my ideas on how to expand on it.

Algorithm

The above-mentioned papers have used fancy techniques including the wavelet transform and auditory filter banks modeled after the human ear. However, I was able to get decent results with a standard STFT spectrogram, then using phase reconstruction to get time-domain audio samples. The full process looks like this:

Compute a magnitude spectrogram $S$ of the time-domain input signal $x$. A fairly high overlap is advised.
Compute any number of feature vectors $F_1(S),\, F_2(S),\, \ldots,\, F_n(S)$ and define their concatenation as $F(S)$.
Initialize a randomized magnitude spectrogram $\hat{S}$.
Use gradient descent on $\hat{S}$ to minimize the error $E(\hat{S}) = ||F(S) - F(\hat{S})||$ (using any norm such as the squared error).
Use phase reconstruction such as the Griffin-Lim algorithm on $\hat{S}$ to produce a resynthesized signal $\hat{x}$.

The cornerstone of making this algorithm work well is that we choose an $F(S)$ that’s differentiable (or reasonably close). This means that the gradient $\nabla E$ can be computed with automatic differentiation (classical backpropagation). As such, this algorithm is best implemented in a differentiable computing environment like PyTorch or Tensorflow.

The features $F(S)$, as well as their relative weights, greatly affect the sound. If $F(S)$ is highly time-dependent then the resynthesized signal will mimic the original in evolution. On the other hand, if $F(S)$ does a lot of pooling across the time axis then the resynthesized signal will mostly ignore the large-scale structure of the input signal. I’m mostly interested in the latter case, where $F(S)$ significantly “remixes” the input signal and disregards the overall structure of the original.

We will represent $S$ as a 2D tensor where the first dimension is frequency and the second is time. As a matrix, each row is an FFT bin, and each column a frame.

If using a fancy alternative to the magnitude spectrogram such CWT or cochlear filter banks, you may have to do gradient descent all the way back to the time-domain samples $x$. These analysis methods break down to linear frequency transforms that produce complex numbers followed by computing the absolute value of each bin, so differentiability is maintained.

Interstate Hydra - Nocturnes

Nathan Ho — Mon, 16 Jan 2023 08:00:00 GMT

Exactly three years ago I made an ambient EP under the name “Interstate Hydra” and put it up on Bandcamp. Up until now I have kept this alias a secret and only shown these tunes to a few people, but I figured it’s been long enough.

Nocturnes was made entirely in Audacity using the “sound dumplings” method described in an earlier post. This was a total 180 from my usual workflow, which is to synthesize everything with SuperCollider code.

“For Ellie” is a single released months after Nocturnes, and is formally the simplest Interstate Hydra track, comprising a row of sound dumplings in arch form.

I don’t plan on returning to the Interstate Hydra alias for quite some time, but I hope you enjoy these tracks. Thank you for listening.

OddVoices Dev Log 3: Pitch Contours

Nathan Ho — Mon, 07 Feb 2022 22:56:37 GMT

This is part of an ongoing series of posts about OddVoices, a singing synthesizer I’ve been building. OddVoices has a Web version, which you can now access at the newly registered domain oddvoices.org.

Unless we’re talking about pitch correction settings, the pitch of a human voice is generally not piecewise constant. A big part of any vocal style is pitch inflections, and I’m happy to say that these have been greatly improved in OddVoices based on studies of real pitch data. But first, we need…

Pitch detection

A robust and high-precision monophonic pitch detector is vital to OddVoices for two reasons: first, the input vocal database needs to be normalized in pitch during the PSOLA analysis process, and second, the experiments we conduct later in this blog post require such a pitch detector.

There’s probably tons of Python code out there for pitch detection, but I felt like writing my own implementation to learn a bit about the process. My requirements are that the pitch detector should work on speech signals, have high accuracy, be as immune to octave errors as possible, and not require an expensive GPU or a massive dataset to train. I don’t need real time capabilities (although reasonable speed is desirable), high background noise tolerance, or polyphonic operation.

I shopped around a few different papers and spent long hours implementing different algorithms. I coded up the following:

Cepstral analysis [Noll1966]
Autocorrelation function (ACF) with prefiltering [Rabiner1977]
Harmonic Product Spectrum (HPS)
A simplified variant of Spectral Peak Analysis (SPA) [Dziubinski2004]
Special Normalized Autocorrelation (SNAC) [McLeod2008]
Fourier Approximation Method (FAM) [Kumaraswamy2015]

There are tons more algorithms out there, but these were the ones that caught my eye for some reason or another. All methods have their own upsides and downsides, and all of them are clever in their own ways. Some algorithms have parameters that can be tweaked, and I did my best to experiment with those parameters to try to maximize results for the test dataset.

I created a test dataset of 10000 random single-frame synthetic waveforms with fundamentals ranging from 60 Hz to 1000 Hz. Each one has harmonics ranging up to the Nyquist frequency, and the amplitudes of the harmonics are randomized and multiplied by $1 / n$ where $n$ is the harmonic number. Whether this is really representative of speech is not an easy question, but I figured it would be a good start.

I scored each algorithm by how many times it produced a pitch within a semitone of the actual fundamental frequency. We’ll address accuracy issues in a moment. The scores are:

Algorithm	Score
Cepstrum	9961/10000
SNAC	9941/10000
FAM	9919/10000
Simplified SPA	9789/10000
ACF	9739/10000
HPS	7743/10000

All the algorithms performed quite acceptably with the exception of the Harmonic Product Spectrum, which leads me to conclude that HPS is not really appropriate for pitch detection, although it does have other applications such as computing the chroma [Lee2006].

What surprised me most is that one of the simplest algorithms, cepstral analysis, also appears to be the best! Confusingly, a subjective study of seven pitch detection algorithms by McGonegal et al. [McGonegal1977] ranked the cepstrum as the 2nd worst. Go figure.

I hope this comparison was an interesting one in spite of how small and unscientific the study is. Be reminded that it is always possible that I implemented one or more of the algorithms wrong, didn’t tweak it in the right way, or didn’t look much into strategies for improving it.

The final algorithm

I arrived at the following algorithm by crossbreeding my favorite approaches:

Compute the “modified cepstrum” as the absolute value of the IFFT of $\log(1 + |X|)$, where $X$ is the FFT of a 2048-sample input frame $x$ at a sample rate of 48000 Hz. The input frame is not windowed – for whatever reason that worked better!
Find the highest peak in the modified cepstrum whose quefrency is above a threshold derived from the maximum frequency we want to detect.
Find all peaks that exceed 0.5 times the value of the highest peak.
Find the peak closest to the last detected pitch, or if there is no last detected pitch, use the highest peak.
Convert quefrency into frequency to get the initial estimate of pitch.
Recompute the magnitude spectrum of $x$, this time with a Hann window.
Find the values of the three bins around the FFT peak at the estimated pitch.
Use an artificial neural network (ANN) on the bin values to interpolate the exact frequency.

The idea of the modified cepstrum, i.e. adding 1 before taking the logarithm of the magnitude spectrum, is borrowed from Philip McLeod’s dissertation on SNAC, and prevents taking the logarithm of values too close to zero. The peak picking method is also taken from the same resource.

The use of an artificial neural network to refine the estimate is from the SPA paper [Dziubinski2004]. The ANN in question is a classic feedforward perceptron, and takes as input the magnitudes of three FFT bins around a peak, normalized so the center bin has an amplitude of 1.0. This means that the center bin’s amplitude is not needed and only two input neurons are necessary. Next, there is a hidden layer with four neurons and a tanh activation function, and finally an output layer with a single neuron and a linear activation function. The output format of the ANN ranges from -1 to +1 and indicates the offset of the sinusoidal frequency from the center bin, measured in bins.

The ANN is trained on a set of synthetic data similar to the test data described above. I used the MLPRegressor in scikit-learn, set to the default “adam” optimizer. The ANN works astonishingly well, yielding average errors less than 1 cent against my synthetic test set.

In spite of the efforts to find a nearly error-free pitch detector, the above algorithm still sometimes produces errors. Errors are identified as pitch data points that exceed a manually specified range. Errors are corrected by linearly interpolating the surrounding good data points.

Source code for the pitch detector is in need of some cleanup and is not yet publicly available as of this writing, but should be soon.

Vocal pitch contour phenomena

I’m sure the above was a bit dry for most readers, but now that we’re armed with an accurate pitch detector, we can study the following phenomena:

Drift: low frequency noise from 0 to 6 Hz [Cook1996].
Jitter: high frequency noise from 6 to 12 Hz.
Vibrato: deliberate sinusoidal pitch variation.
Portamento: lagging effect when changing notes.
Overshoot: when moving from one pitch to another, the singer may extend beyond the target pitch and slide back into it [Lai2009].
Preparation: when moving from one pitch to another, the singer may first move away from the target pitch before approaching it.

There is useful literature on most of these six phenomena, but I also wanted to gather my own data and do a little replication work. I had a gracious volunteer sing a number of melodies consisting of one or two notes, with and without vibrato, and I ran them through my pitch detector to determine the pitch contours.

Drift and jitter: In his study, Cook reported drift of roughly -50 dB and jitter at about -60 to -70 dB. Drift has a roughly flat spectrum and jitter has a sloping spectrum of around -8.5 dB per octave. My data is broadly consistent with these figures, as can be seen in the below spectra.

Drift and jitter are modeled as $f \cdot (1 + x)$ where $f$ is the static base frequency and $x$ is the deviation signal. The ratio $x / f$ is treated as an amplitude and converted to decibels, and this is what is meant by drift and jitter having a decibel value.

Cook also notes that drift and jitter also exhibit a small peak around the natural vibrato frequency, here around 3.5 Hz. Curiously, I don’t see any such peak in my data.

Synthesis can be done with interpolated value noise for drift and “clipped brown noise” for jitter, added together. Interpolated value noise is downsampled white noise with sine wave segment interpolation. Clipped brown noise is defined as a random walk that can’t exceed the range [-1, +1].

Vibrato is, not surprisingly, a sine wave LFO. However, a perfect sine wave sounds pretty unrealistic. Based on visual inspection of vibrato data, I multiplied the sine wave by random amplitude modulation with interpolated value noise. The frequency of the interpolated value noise is the same as the vibrato frequency.

Also note that vibrato takes a moment to kick in, which is simple enough to emulate with a little envelope at the beginning of each note.

Portamento, overshoot, and preparation I couldn’t find much research on, so I sought to collect a good amount of data on them. I asked the singer to perform two-note melodies consisting of ascending and descending m2, m3, P4, P5, and P8, each four times, with instructions to use “natural portamento.” I then ran all the results through the pitch tracker and visually measured rough averages of preparation time, preparation amount, portamento time, overshoot time, and overshoot amount. Here’s the table of my results.

Interval	Prep. time	Prep. amount	Port. time	Over. time	Over. amount
m3 ascending	0.1	0.7	0.15	0.2	0.5
m3 descending	no preparation		0.1	0.3	1
P4 ascending	no preparation		0.1	0.3	0.5
P4 descending	no preparation		0.2	0.2	1
P5 ascending	0.1	0.5	0.2	no overshoot
P5 descending	no preparation		0.2	0.1	1
P8 ascending	0.1	1	0.25	no overshoot
P8 descending	no preparation		0.15	0.1	1.5

As one might expect, portamento time gently increases as the interval gets larger. There is no preparation for downward intervals, and spotty overshoot for upward intervals, both of which make some sense physiologically – you’re much more likely to involuntarily relax in pitch rather than tense up. Overshoot and preparation amounts have a slight upward trend with interval size. The overshoot time seems to have a downward trend, but overshoot measurement is pretty unreliable.

Worth noting is the actual shape of overshoot and preparation.

In OddVoices, I model these three pitch phenomena by using quarter-sine-wave segments, and assuming no overshoot when ascending and no preparation when descending.

Further updates

Pitch detection and pitch contours consumed most of my time and energy recently, but there are a few other updates too.

As mentioned earlier, I registered the domain oddvoices.org, which currently hosts a copy of the OddVoices Web interface. The Web interface itself looks a little bland – I’d even say unprofessional – so I have plans to overhaul it especially as new parameters are on the way.

The README has been heavily updated, taking inspiration from the article “Art of README”. I tried to keep it concise and prioritize information that a casual reader would want to know.

References

[Noll1966]

Noll, A. Michael. 1966. “Cepstrum Pitch Determination.

[Rabiner1977]

Rabiner, L. 1977. “On the Use of Autocorrelation Analysis for Pitch Detection.”

[Dziubinski2004] (1,2)

Diubinski, M. and Kostek, B. 2004. “High Accuracy and Octave Error Immune Pitch Detection Algorithms.”

[McLeod2008]

McLeod, Philip. 2008. “Fast, Accurate Pitch Detection Tools for Music Analysis.”

[Kumaraswamy2015]

Kumaraswamy, B. and Poonacha, P. G. 2015. “Improved Pitch Detection Using Fourier Approximation Method.”

[Cook1996]

Cook, P. R. 1996. “Identification of Control Parameters in an Articulatory Vocal Tract Model with Applications to the Synthesis of Singing.”

[Lai2009]

Lai, Wen-Hsing. 2009. “An F0 Contour Fitting Model for Singing Synthesis.”

[Lee2006]

Lee, Kyogu. 2006. “Automatic Chord Recognition from Audio Using Enhanced Pitch Class Profile.”

[McGonegal1977]

McGonegal, Carol A. et al. 1977. “A Subjective Evaluation of Pitch Detection Methods Using LPC Synthesized Speech.”

OddVoices Dev Log 2: Phase and Volume

Nathan Ho — Sun, 23 Jan 2022 18:25:00 GMT

This is the second in an ongoing series of dev updates about OddVoices, a singing synthesizer I’ve been developing over the past year. Since we last checked in, I’ve released version 0.0.1. Here are some of the major changes.

New voice!

Exciting news: OddVoices now has a third voice. To recap, we’ve had Quake Chesnokov, a powerful and dark basso profondo, and Cicada Lumen, a bright and almost synth-like baritone. The newest voice joining us is Air Navier (nav-YEH), a soft, breathy alto. Air Navier makes a lovely contrast to the two more classical voices, and I’m imagining it will fit in great in a pop or indie rock track.

Goodbye GitHub

OddVoices makes copious use of Git LFS to store original recordings for voices, and this caused some problems for me this past week. GitHub’s free tier caps the amount of Git LFS storage and the monthly download bandwidth to 1 gigabyte. It is possible to pay $5 to add 50 GB to both storage and bandwidth limits. These purchases are “data packs” and are orthogonal to GitHub Pro.

What’s unfortunate is that all downloads by anyone (including those on forks) contribute to the monthly download bandwidth, and even worse, downloads from GitHub Actions do also. I am easily running CI dozens of times per week, and multiplied by the gigabyte or so of audio data, the plan is easily maxed out.

A free GitLab account has a much more workable storage limit of 10 GB, and claims unlimited bandwidth for now. GitLab it is. Consider this a word of warning for anyone making serious use of Git LFS together with GitHub, and especially GitHub Actions.

Goodbye MBR-PSOLA

OddVoices, taking after speech synthesizers of the 90’s, is based on concatenation of recorded segments. These segments are processed using PSOLA, which turns them into a sequence of frames (grains), each for one pitch period. PSOLA then allows manipulation of the segment in pitch, time, and formants, and sounds pretty clean. The synthesis component is also computationally efficient.

One challenge with a concatenative synthesizer is making the segments blend together nicely. We are using a crossfade, but a problem arises – if the phases of the overlapping frames don’t approximately match, then unnatural croaks and “doubling” artifacts happen.

There is a way to solve this: manually. If one lines up the locations of the frames so they are centered on the exact times when the vocal folds close (the so-called “glottal closure instant” or GCI), the phases will match. Since it’s difficult to find the GCI from a microphone signal, an electroglottograph (EGG) setup is typically used. I don’t have an EGG on hand, and I’m working remotely with singers, so this solution has to be ruled out.

A less daunting solution is to use FFT processing to make all phases zero, or set every frame to minimum phase. These solve the phase mismatch problem but sound overtly robotic and buzzy. (Forrest Mozer’s TSI S14001A speech synthesis IC, memorialized in chipspeech’s Otto Mozer, uses the zero phase method – see US4214125A.) MBR-PSOLA softens the blows of these methods by using a random set of phases that are fixed throughout the voice database. Dutoit recommends only randomizing the lower end of the spectrum while leaving the highs untouched. It sounds pretty good, but there is still an unnatural hollow and phasey quality to it.

I decided to search around the literature and see if there’s any way OddVoices can improve on MBR-PSOLA. I found [Stylianou2001], which seems to fit the bill. It recommends computing the “center” of a grain, then offsetting the frame so it is centered on that point. The center is not the exact same as the GCI, but it acts as a useful stand-in. When all grains are aligned on their centers, their phases should be roughly matched too – and all this happens without modifying the timbre of the voice, since all we’re doing is a time offset.

I tried this on the Cicada voice, and it worked! I didn’t conduct any formal listening experiment, but it definitely sounded clearer and lacking the weird hollowness of the MBROLA voice. Then I tried it on the Quake voice, and it sounded extremely creaky and hoarse. This is the result of instabilities in the algorithm, producing random timing offsets for each grain.

Frame adjustment

Let $x[t]$ be a sampled quasiperiodic voice signal with period $T$, with a sample rate of $f_s$. We round $T$ to an integer, which works well enough for our application. Let $w[t]$ be a window function (I use a Hann window) of length $2T$. Brackets are zero-indexed, because we are sensible people here.

The PSOLA algorithm divides $x$ into a number of frames of length $2T$, where the $n$-th frame is given by $s_n[t] = w[t] x[t + nT]$.

Stylianou proposes the “differentiated phase spectrum” center, or DPS center, which is computed like so:

\begin{equation*} \eta = \frac{T}{2\pi} \arg \sum_{i = -T}^{T - 1} s_n^2[t] e^{2 \pi j t / T} \end{equation*}

$\eta$ is here expressed in samples. The DPS center is not the GCI. It’s… something else, and it’s admitted in the paper that it isn’t well defined. However, it is claimed that it will be close enough to the GCI, hopefully by a near-constant offset. To normalize a frame on its DPS center, we recalculate the frame with an offset of $\eta$: $s'_n[t] = w[t] x[t + nT + \text{round}(\eta)]$.

The paper also discusses the center of gravity of a signal as a center close to the GCI. However, the center of gravity is less robust than the DPS center, as it can be shown that the center can be computed from just a single bin in the discrete Fourier transform, whereas the DPS center involves the entire spectrum.

Here’s where we go beyond the paper. As discussed above, for certain signals $\eta$ can be noisy, and using this algorithm as-is can result in audible jitter in the result. The goal, then, is to find a way to remove noise from $\eta$.

After many hours of experimenting with different solutions, I ended up doing a lowpass filter on $\eta$ to remove high-frequency noise. A caveat is that $\eta$ is a circular value that wraps around with period $T$, and performing a standard lowpass filter will smooth out discontinuities produced by wrapping, which is not what we want. The trick is to use an encoding common in circular statistics, and especially in machine learning: convert it to sine and cosine, perform filtering on both signals, and convert it back with atan2. A rectangular FIR filter worked perfectly well for my application.

Overall the result sounds pretty good. There are still some minor issues with it, but I hope to iron those out in future versions.

Volume normalization

I encountered two separate but related issues regarding the volume of the voices. The first is that the voices are inconsistent in volume – Cicada was much louder than the other two. The second, and the more serious of the two, is that segments can have different volumes when they are joined, and this results in a “choppy” sound with discontinuities.

I fixed global volume inconsistency by taking the RMS amplitude of the entire segment database and normalizing it to -20 dBFS. For voices with higher dynamic range, this caused some of the louder consonants to clip, so I added a safety limiter that ensures the peak amplitude of each frame is no greater than -6 dBFS.

Segment-level volume inconsistency can be addressed by examining diphones that join together and adjusting their amplitudes accordingly. Take the phoneme /k/, and gather a list of all diphones of the form k* and *k. Now inspect the amplitudes at the beginning of k* diphones, and the amplitudes at the end of *k diphones. Take the RMS of all these amplitudes together to form the “phoneme amplitude.” Repeat for all other phonemes. Then, for each diphone, apply a linear amplitude envelope so that the beginning frames match the first phoneme’s amplitude and the ending frames match the second phoneme’s amplitude. The result is that all joined diphones will have a matched amplitude.

Conclusion

The volume normalization problem in particular taught me that developing a practical speech or singing synthesizer requires a lot more work than papers and textbooks might make you think. Rather, the descriptions in the literature are only baselines for a real system.

More is on the way for OddVoices. I haven’t yet planned out the 0.0.2 release, but my hope is to work on refining the existing voices for intelligibility and naturalness instead of adding new ones.

References

[Stylianou2001]

Stylianou, Yannis. 2001. “Removing Linear Phase Mismatches in Concatenative Speech Synthesis.”

OddVoices Dev Log 1: Hello World!

Nathan Ho — Wed, 12 Jan 2022 02:00:09 GMT

The free and open source singing synthesizer landscape has a few projects worth checking out, such as Sinsy, eCantorix, meSing, and MAGE. While each one has its own unique voice and there’s no such thing as a bad speech or singing synthesizer, I looked into all of them and more and couldn’t find a satisfactory one for my musical needs.

So, I’m happy to announce OddVoices, my own free and open source singing synthesizer based on diphone concatenation. It comes with two English voices, with more on the way. If you’re not some kind of nerd who uses the command line, check out OddVoices Web, a Web interface I built for it with WebAssembly. Just upload a MIDI file and write some lyrics and you’ll have a WAV file in your browser.

OddVoices is based on MBR-PSOLA, which stands for Multi-Band Resynthesis Pitch Synchronous Overlap Add. PSOLA is a granular synthesis-based algorithm for playback of monophonic sounds such that the time, formant, and pitch axes can be manipulated independently. The MBR part is a slight modification to PSOLA that prevents unwanted phase cancellation when crossfading between concatenated samples, and solves other problems too. For more detail, check out papers from the MBROLA project. The MBROLA codebase itself has some tech and licensing issues I won’t get into, but the algorithm is perfect for what I want in a singing synth. Note that OddVoices doesn’t interface with MBROLA.

I’ll use this post to discuss some of the more interesting challenges I had to work on in the course of the project so far. This is the first in a series of posts I will be making about the technical side of OddVoices.

Vowel mergers

OddVoices currently only supports General American English (GA), or more specifically the varieties of English that I and the singers speak. I hope in the future that I can correct this bias by including other languages and other dialects of English.

When assembling the list of phonemes, the cot-caught merger immediately came up. I decided to merge them, and make /O/ and /A/ aliases except for /Or/ and /Ar/ (here using X-SAMPA). To reduce the number of phonemes and therefore phoneme combinations, I represent /Or/ internally as /oUr/.

A more interesting merger concerns the problem of the schwa. In English, the schwa is used to represent an unstressed syllable, but the actual phonetics of that syllable can vary wildly. In singing, a syllable that would be unstressed in spoken English can be drawn out for multiple seconds and become stressed. The schwa isn’t actually sung in these cases, and is replaced with another phoneme. As one of the singers put it, “the schwa is a big lie.”

This matters when working with the CMU Pronouncing Dictionary, which I’m using for pronouncing text. Take a word like “imitate” – the second syllable is unstressed, and the CMUDict transcribes it as a schwa. But when sung, it’s more like /I/. This is simply a limitation of the CMUDict that I don’t have a good solution for. In the end I merge /@/ with /V/, since the two are closely related in GA. Similarly, /3`/ and /@`/ are merged, and the CMUDict doesn’t even distinguish those.

Real-time vs. semi-real-time operation

A special advantage of OddVoices over alternative offerings is that it’s built from scratch to work in real time. That means that it can become a UGen for platforms like SuperCollider and Pure Data, or even a VST plugin in the far future. I have a SuperCollider UGen in the works, but there’s some tricky engineering work involving communication between RT and NRT threads that I haven’t tackled yet. Stay tuned.

There is a huge caveat to real time operation: singers don’t operate in perfect real time! To see why, imagine the lyrics “rice cake,” sung with two half notes. The final /s/ in “rice” has to happen before the initial /k/ in “cake,” and the latter happens right on the third beat, so the singer has to anticipate the third beat with the consonant /s/. But in MIDI and real-time keyboard playing, there is no way to predict when the note off will happen until the third beat has already arrived.

VOCALOID handles this by being its own DAW with a built-in sequencer, so it can look ahead as much as it needs. chipspeech and Alter/Ego work in real time. In their user guides, they ask the user to shorten every MIDI note to around 50%-75% of its length to accommodate final consonant clusters. If this is not done, a phenomenon I call “lyric drift” happens and the lyrics misalign from the notes.

OddVoices supports two possible modes: true real-time mode and semi-real-time mode. In true real-time mode, we don’t know the durations of notes, so we trigger the final consonant cluster on a note off. Like chipspeech and Alter/Ego, this requires manual shortening of notes to prevent lyric drift. Alternatively, OddVoices supports a semi-real-time mode where every note on is accompanied by the duration of the note. This way OddVoices can predict the timing of the final consonant cluster, but still operate in otherwise real-time.

Semi-real-time mode is used in OddVoices’ MIDI frontend, and can also be used in powerful sequencing environments like SC and Pd by sending a “note length” signal along with the note on trigger. I think it’s a nice compromise between the constraints of real-time and the omniscience of non-real-time.

Syllable compression

After I implemented semi-real-time mode, another problem remained that reared its head in fast singing passages. This happens when, say, the lyric “rice cake” is sung very quickly, and the diphones _r raI aIs (here using X-SAMPA notation), when concatenated, will be longer than the note length. The result is more lyric drift – the notes and the lyrics diverge.

The fix for this was to peek ahead in the diphone queue and find the end of the final consonant cluster, then add up all the segment lengths from the beginning to that point. This is how long the entire syllable would last. This is then compared to the note length, and if it is longer, the playback speed is increased for that syllable to compensate. In short, consonants have to be spoken quickly in order to fit in quickly sung passages.

The result is still not entirely satisfactory to my ears, and I plan to improve it in future versions of the software. Syllable compression is of course only available in semi-real-time mode.

Syllable compression is evidence that fast singing is phonetically quite different from slow singing, and perhaps more comparable to speech.

Stray thoughts

This is my second time using Emscripten and WebAssembly in a project, and I find it an overall pleasant technology to work with (especially with embind for C++ bindings). I did run into an obstacle, however, which was that I couldn’t figure out how to compile libsndfile to WASM. The only feature I needed was writing a 16-bit mono WAV file, so I dropped libsndfile and wrote my own code for that.

I was surprised by the compactness of this project so far. The real-time C++ code adds up to 1,400 lines, and the Python offline analysis code only 600.

Hearing Graphs

Nathan Ho — Tue, 21 Sep 2021 00:38:19 GMT

My latest project is titled Hearing Graphs, and you can access it by clicking on the image below.

Hearing Graphs is a sonification of graphs using the graph spectrum – the eigenvalues of the graph’s adjacency matrix. Both negative and positive eigenvalues are represented by piano samples, and zero eigenvalues are interpreted with a bass drum. The multiplicity of each eigenvalue is represented by hitting notes multiple times.

Most sonifications establish some audible relationship between the source material and the resulting audio. This one remains mostly incomprehensible, especially to a general audience, so I consider it a failed experiment in that regard. Still, it was fun to make.

Announcing Canvas

Nathan Ho — Wed, 15 Sep 2021 01:26:30 GMT

Canvas (working title) is a visual additive synthesizer for Windows, macOS, and Linux where you can create sound by drawing an image. This is accomplished with a bank of 239 oscillators spaced at quarter tones, with stereo amplitudes mapped to the red and blue channels of the image. You can import images and sonify them, you can import sounds and convert their spectrograms to images, and you can apply several image filters like reverb, tremolo, and chorus. Check out the demo:

Canvas is directly inspired by the Image Synth from U&I Software’s MetaSynth. When I first heard of MetaSynth, I wrote it off as a gimmick that could only produce sweeps and whooshes. It wasn’t until I heard Benn Jordan’s recent demo and learned of its powerful set of image filters that I was immediately sold on the approach. I decided to build an alternative that’s just featureful enough to yield musical results, a sort of MS Paint to MetaSynth’s Photoshop.

Canvas has a lot of rough edges, and currently requires building from scratch if using Linux or macOS. Nevertheless, I hope it is of some interest to the music tech community.

Venn 7

Nathan Ho — Thu, 31 Dec 2020 08:00:00 GMT

I have released Venn 7, a little artistic Web app that turns symmetric 7-fold Venn diagrams into a musical interface. Click the screenshot below to visit the app, which also has an attached explanatory article.

There are some bugs and minor issues with it which I hope to hammer out as I receive feedback. Check the repository for bug reports and some technical documentation.