OddVoices Dev Log 2: Phase and Volume

This is the second in an ongoing series of dev updates about OddVoices, a singing synthesizer I've been developing over the past year. Since we last checked in, I've released version 0.0.1. Here are some of the major changes.

New voice!

Exciting news: OddVoices now has a third voice. To recap, we've had Quake Chesnokov, a powerful and dark basso profondo, and Cicada Lumen, a bright and almost synth-like baritone. The newest voice joining us is Air Navier (nav-YEH), a soft, breathy alto. Air Navier makes a lovely contrast to the two more classical voices, and I'm imagining it will fit in great in a pop or indie rock track.

Goodbye GitHub

OddVoices makes copious use of Git LFS to store original recordings for voices, and this caused some problems for me this past week. GitHub's free tier caps the amount of Git LFS storage and the monthly download bandwidth to 1 gigabyte. It is possible to pay \$5 to add 50 GB to both storage and bandwidth limits. These purchases are "data packs" and are orthogonal to GitHub Pro.

What's unfortunate is that all downloads by anyone (including those on forks) contribute to the monthly download bandwidth, and even worse, downloads from GitHub Actions do also. I am easily running CI dozens of times per week, and multiplied by the gigabyte or so of audio data, the plan is easily maxed out.

A free GitLab account has a much more workable storage limit of 10 GB, and claims unlimited bandwidth for now. GitLab it is. Consider this a word of warning for anyone making serious use of Git LFS together with GitHub, and especially GitHub Actions.

Goodbye MBR-PSOLA

OddVoices, taking after speech synthesizers of the 90's, is based on concatenation of recorded segments. These segments are processed using PSOLA, which turns them into a sequence of frames (grains), each for one pitch period. PSOLA then allows manipulation of the segment in pitch, time, and formants, and sounds pretty clean. The synthesis component is also computationally efficient.

One challenge with a concatenative synthesizer is making the segments blend together nicely. We are using a crossfade, but a problem arises -- if the phases of the overlapping frames don't approximately match, then unnatural croaks and "doubling" artifacts happen.

There is a way to solve this: manually. If one lines up the locations of the frames so they are centered on the exact times when the vocal folds close (the so-called "glottal closure instant" or GCI), the phases will match. Since it's difficult to find the GCI from a microphone signal, an electroglottograph (EGG) setup is typically used. I don't have an EGG on hand, and I'm working remotely with singers, so this solution has to be ruled out.

A less daunting solution is to use FFT processing to make all phases zero, or set every frame to minimum phase. These solve the phase mismatch problem but sound overtly robotic and buzzy. (Forrest Mozer's TSI S14001A speech synthesis IC, memorialized in chipspeech's Otto Mozer, uses the zero phase method -- see US4214125A.) MBR-PSOLA softens the blows of these methods by using a random set of phases that are fixed throughout the voice database. Dutoit recommends only randomizing the lower end of the spectrum while leaving the highs untouched. It sounds pretty good, but there is still an unnatural hollow and phasey quality to it.

I decided to search around the literature and see if there's any way OddVoices can improve on MBR-PSOLA. I found [Stylianou2001], which seems to fit the bill. It recommends computing the "center" of a grain, then offsetting the frame so it is centered on that point. The center is not the exact same as the GCI, but it acts as a useful stand-in. When all grains are aligned on their centers, their phases should be roughly matched too -- and all this happens without modifying the timbre of the voice, since all we're doing is a time offset.

I tried this on the Cicada voice, and it worked! I didn't conduct any formal listening experiment, but it definitely sounded clearer and lacking the weird hollowness of the MBROLA voice. Then I tried it on the Quake voice, and it sounded extremely creaky and hoarse. This is the result of instabilities in the algorithm, producing random timing offsets for each grain.

Let $$x[t]$$ be a sampled quasiperiodic voice signal with period $$T$$, with a sample rate of $$f_s$$. We round $$T$$ to an integer, which works well enough for our application. Let $$w[t]$$ be a window function (I use a Hann window) of length $$2T$$. Brackets are zero-indexed, because we are sensible people here.

The PSOLA algorithm divides $$x$$ into a number of frames of length $$2T$$, where the $$n$$-th frame is given by $$s_n[t] = w[t] x[t + nT]$$.

Stylianou proposes the "differentiated phase spectrum" center, or DPS center, which is computed like so:

\begin{equation*} \eta = \frac{T}{2\pi} \arg \sum_{i = -T}^{T - 1} s_n^2[t] e^{2 \pi j t / T} \end{equation*}

$$\eta$$ is here expressed in samples. The DPS center is not the GCI. It's... something else, and it's admitted in the paper that it isn't well defined. However, it is claimed that it will be close enough to the GCI, hopefully by a near-constant offset. To normalize a frame on its DPS center, we recalculate the frame with an offset of $$\eta$$: $$s'_n[t] = w[t] x[t + nT + \text{round}(\eta)]$$.

The paper also discusses the center of gravity of a signal as a center close to the GCI. However, the center of gravity is less robust than the DPS center, as it can be shown that the center can be computed from just a single bin in the discrete Fourier transform, whereas the DPS center involves the entire spectrum.

Here's where we go beyond the paper. As discussed above, for certain signals $$\eta$$ can be noisy, and using this algorithm as-is can result in audible jitter in the result. The goal, then, is to find a way to remove noise from $$\eta$$.

After many hours of experimenting with different solutions, I ended up doing a lowpass filter on $$\eta$$ to remove high-frequency noise. A caveat is that $$\eta$$ is a circular value that wraps around with period $$T$$, and performing a standard lowpass filter will smooth out discontinuities produced by wrapping, which is not what we want. The trick is to use an encoding common in circular statistics, and especially in machine learning: convert it to sine and cosine, perform filtering on both signals, and convert it back with atan2. A rectangular FIR filter worked perfectly well for my application.

Overall the result sounds pretty good. There are still some minor issues with it, but I hope to iron those out in future versions.

Volume normalization

I encountered two separate but related issues regarding the volume of the voices. The first is that the voices are inconsistent in volume -- Cicada was much louder than the other two. The second, and the more serious of the two, is that segments can have different volumes when they are joined, and this results in a "choppy" sound with discontinuities.

I fixed global volume inconsistency by taking the RMS amplitude of the entire segment database and normalizing it to -20 dBFS. For voices with higher dynamic range, this caused some of the louder consonants to clip, so I added a safety limiter that ensures the peak amplitude of each frame is no greater than -6 dBFS.

Segment-level volume inconsistency can be addressed by examining diphones that join together and adjusting their amplitudes accordingly. Take the phoneme /k/, and gather a list of all diphones of the form k* and *k. Now inspect the amplitudes at the beginning of k* diphones, and the amplitudes at the end of *k diphones. Take the RMS of all these amplitudes together to form the "phoneme amplitude." Repeat for all other phonemes. Then, for each diphone, apply a linear amplitude envelope so that the beginning frames match the first phoneme's amplitude and the ending frames match the second phoneme's amplitude. The result is that all joined diphones will have a matched amplitude.

Conclusion

The volume normalization problem in particular taught me that developing a practical speech or singing synthesizer requires a lot more work than papers and textbooks might make you think. Rather, the descriptions in the literature are only baselines for a real system.

More is on the way for OddVoices. I haven't yet planned out the 0.0.2 release, but my hope is to work on refining the existing voices for intelligibility and naturalness instead of adding new ones.

References

Stylianou2001

Stylianou, Yannis. 2001. "Removing Linear Phase Mismatches in Concatenative Speech Synthesis."