Skip to main content

State of the Blog, 2023

As 2023 ends, it seems like a good time to reflect and allow myself some vanity. It’s also a few days shy of the 15th anniversary of the first thing I created on the Internet (a certain wiki I have mentioned before). Amusingly, I’m really doing the same thing as I was back then: publicly writing about niche technical subjects. Back then it was mathematics, now music technology.

I set a goal for myself to release at least one substantial technical blog post every month this year. I’m 11 for 12 as of this writing, with only this month’s blog post unfinished:

Outside of writing, I also released a full-length album, began performing live, and gave some talks and guest lectures.

Meeting the monthly deadline required pushing myself a little, but I’m glad I set this schedule, because the self-imposed time constraints forced me to scope projects realistically and suppress perfectionism. Often all I needed to do for a blog post was to find an exciting topic and demonstrate its value with some nice sounds; there’s a time and place for thorough research and literature review, but I’m okay without it for most projects here. I also felt the variety is beneficial, since it helped me gauge my level of interest in various music tech topics and, more importantly, my audience’s. I don’t have analytics or a comment section, but I do hear from readers by email occasionally (thank you!), and this helps me understand what kind of writing is most wanted in the music tech community.

Speaking of engagement, now’s a good time as any to announce that I will be pivoting to topics suggested to me by this free online Blog Idea Generator:

1. Unleashing the Power of Music Technology: How it's Revolutionizing the Music Industry 2. From Vinyl to Virtual: Exploring the Evolution of Music Technology 3. The Future of Music: How Technology is Shaping the Sound of Tomorrow 4. Music Tech 101: A Beginner's Guide to Understanding the Latest Innovations 5. Behind the Beats: The Role of Technology in Music Production and Composition

Read more…

Audio Effects with Cepstral Processing

Much like the previously discussed wavelet transforms, the cepstrum is a frequency-domain method that I see talked about a lot in the scientific research literature, but only occasionally applied to the creative arts. The cepstrum is sometimes described as “the FFT of the FFT” (although this is an oversimplification since there are nonlinear operations sandwiched in between those two transforms, and the second is really the Discrete Cosine Transform). In contrast to wavelets, the cepstrum is very popular in audio processing, most notably in the ubiquitous mel-frequency cepstral coefficients (MFCCs). Some would not consider the MFCCs a true “cepstrum,” others would say the term “cepstrum” is broad enough to encompass them. I have no strong opinion.

In almost all applications of the cepstrum, it is used solely for analysis and generally isn’t invertible. This is the case for MFCCs, where the magnitude spectrum is downsampled in the conversion to the mel scale, resulting in a loss of information. Resynthesizing audio from the cepstral descriptors commonly used in the literature is an underdetermined problem, usually tackled with machine learning or other complex optimization methods.

However, it is actually possible to implement audio effects in the MFCC domain with perfect reconstruction. You just have to keep around all the information that gets discarded, resulting in this signal chain:

  1. Take the STFT. The following steps apply for each frame.

  2. Compute the power spectrum (square of magnitude spectrum) and the phases.

  3. Compute a bank of bandpass filters on the power spectrum, equally spaced on the mel-frequency scale. This is the mel spectrum, and it downsamples the magnitude spectrum, losing information.

  4. Upsample the mel spectrum back up to full spectral envelope. Divide the magnitude spectrum by the envelope to produce the residual spectrum. (You have to add a little epsilon to the envelope to prevent zero division.)

  5. Compute the logarithm and then the Discrete Cosine Transform of the mel spectrum to produce the MFCCs.

  6. Perform any processing desired.

  7. Invert step 5: take the inverse DCT and then the exponent to produce the mel spectrum.

  8. Invert step 4: upsample the mel spectrum to the spectral envelope, and multiply it by the residual spectrum to produce the power spectrum.

  9. Recombine the power spectrum with the phases to produce the complex spectrum.

  10. Inverse FFT, then overlap-add to resynthesize the signal.

It’s a lot of steps, but as an extension of the basic MFCC algorithm, it’s not that much of a leap. I would not be surprised if someone has done this before, storing all residuals when computing the MFCCs so the process can be inverted, but I had difficulty finding prior work on this for the particular application of musical effects. Something similar is done in MFCC-based vocoders, where the “residual spectrum” instead replaced with speech parameters such as pitch, but I haven’t seen this done on general, non-speech signals.

I will be testing on the following mono snippet of Ed Sheeran’s “Perfect.” (If you plan on doing many listening tests on a musical signal, never use a sample of music you enjoy.)

As for the parameters: mono, 48 kHz sample rate, 2048-sample FFT buffer with Hann window and 50% overlap, 30-band mel spectrum from 20 Hz to 20 kHz.

Cepstral EQ

Because of the nonlinearities involved in the signal chain, merely multiplying the MFCCs by a constant can do some pretty strange things. Zeroing out all MFCCs has the effect of removing the spectral envelope and whitening the signal. The effect on vocal signals is pronounced, turning Ed into a bumblebee.

Multiplying all MFCCs by 2 has a subtle, hollower quality, acting as an expander for the spectral envelope.

MFCCs are signed and can also be multiplied by negative values, which inverts the phase of a cosine wave component. The effect on the signal is hard to describe:

We can apply any MFCC envelope desired. Here’s a sine wave:

Cepstral frequency shifting

Technically this would be “quefrency shifting.” This cyclically rotates the MFCCs to brighten the signal:

And here’s the downward equivalent:

Cepstral frequency scaling

Resampling the MFCCs sounds reminiscent of formant shifting. This is related to the time-scaling property of the Fourier transform: if you resample the spectrum, you’re also resampling the signal. Here’s upward scaling:

Here’s downward scaling:

Cepstral time-based effects

Here’s what happens when we freeze the MFCCs every few frames:

Lowpass filtering the MFCCs over time tends to slur speech:

Stray thoughts

I have barely scratched the surface of cepstral effects here, opting only to explore the most mathematically straightforward operations. That the MFCCs produce some very weird and very musical effects, even with such simple transformations, is encouraging.

In addition to playing with additional types of effects, it is also worthwhile to adjust the trasforms being used. The DCT as the space for the spectral envelope could be improved on. One (strange) possibility that came to mind is messing with the Multiresolution Analysis of the mel spectrum; I have no idea if that would sound interesting or not, but it’s worth a shot.

It’s possible to bypass the MFCCs and just do the DCT of the log-spectrogram. I experimented with this and found that I couldn’t get it to sound as musical as the mel-based equivalent. I believe this is because the resolution of the FFT isn’t very perceptually salient. The mel scale is in fact doing a lot of heavy lifting here.

Making Synthesized Sounds More Acoustic

I have been experimenting a lot with finding ways to get more acoustic sounds out of synthesizers. These sounds don’t need to be perfect recreations of any particular real instrument, but I want a piece of the complexity and depth that those have, and also to investigate “Pinocchio” synth patches that fall short of becoming a real boy in hopefully interesting ways.

Computer music types often jump to physical modeling, a field that I adore and have casually researched. But with the exception of modal synthesis, most physical modeling paradigms take considerable software engineering tenacity to get good results — especially the finite difference models, but waveguides too. I do intend to explore them further, but also I believe that some cool sounds can come out of chains of completely ordinary oscillators and effects. In my experiments, I’ve come across a bunch of little tricks that can help lend some more realism to those kinds of synth patches. Many of these also apply to sophisticated physical models too (after all, physical models can’t deliver you from having to do sound design).

  1. In general, randomize everything a little bit and modulate everything with a slow, smooth random LFOs.

  2. Real percussive envelopes have a very tall initial spike. Inspecting the waveform of an unprocessed xylophone hit, I was surprised by how loud the transient of a typical percussive instrument is compared to its resonating tail.

  3. The high dynamic range can make such sounds tough to bring up in the mix, and can often be addressed by adding clipping or similar distortion that only gets driven during the initial transient. This improves “bite” as well. However, acoustic sounds by nature have higher dynamic range, and clipping and compression can take away from that. Find a balance that works for you.

  4. Key tracking (modifying synth parameters based on register) is essential. No acoustic instrument has the same physics in every register, and some have very limited pitch range in the first place. I usually at least key track amplitude and the cutoff of a lowpass filter. Don’t get discouraged if something sounds good in one octave but bad if you transpose it. You may even need an entirely different patches for different octaves.

  5. In a tonal percussive synth, it’s essential that partials decay at different rates. A rule of thumb for damped physical resonators is that the decay time is roughly proportional to the inverse of the square of frequency. For example, going up an octave will multiply the decay time by about 0.25. This is not only true of partials within a note, but even of different keys of many mallet instruments. (In a piano, different registers have different physical constructions including different numbers of strings per key, which I believe is specifically compensating for this phenomenon.)

  6. Synthesized drums made using oscillators benefit from some subtle stereo detuning.

  7. You can spruce up standard ADSR or percussive envelopes by using multiple little spikes in sequence, a bit like an 808 clap envelope. These spikes can be obvious or subtle.

  8. Add little noise bursts and puffs to every sharp transient. Delay the noise bursts relative to each other a little bit and randomize all properties slightly. Even if the bursts are subtle, the effect will add up tremendously. Noise doesn’t need to be white or pink; crackly impulsive noise is fun too, and more metallic noise is possible using banks of bandpass filters or inharmonic FM.

  9. Adding a little puff of noise before the transient can sound really nice for reed instruments, and I’ve gotten a pretty decent thumb piano sound with it by simulating the thumb scraping against the key. Watch yourself, Four Tet.

  10. Add “box tone” to every sound source using a bunch of random peaking filters that boost different bands by +-2 dB (maybe more if you’re feeling adventurous!). Wiggle the parameters around slowly if desired, which is pretty ad hoc but might mimic physical changes in temperature, posture, grip of the instrument, etc. Box tone is a good idea in general to compensate for the relative cleanliness of an all-digital signal path. You can even use multiple layers of wiggly EQ with gentle nonlinearities sandwiched between them.

  11. This is an obvious one, but almost all acoustic instruments have resonating bodies, so reverb can make or break the realism of a sound. Use slightly different subtle reverbs on every instrument, and prefer “weird” and metallic reverbs with short decay times. I often just use a bank of parallel comb filters; you can also use short multitap delays. The lush Alesis/Lexicon sound has its place in mixing, but I find that sound a little too smooth to work as an instrument body. Obviously, reverb on master is pretty essential if your instruments are in a concert hall.

  12. Inharmonic modal synthesis (either with actual resonance or with decaying sine waves) can be enhanced with parallel ring modulation with one or more sine waves. This greatly multiplies the number of partials. I like to add a decay on the “modulator” sine waves. This works best for a grungy junk percussion sound, banging on pots and pans that aren’t carefully tuned by instrument builders.

  13. It’s not just the patch, it’s the sequencing. Humanize velocity; don’t just completely randomize it, make the velocity follow the musical phrasing. Also, louder playing is correlated with higher timing accuracy, and conversely softer playing benefits from humanization in arrival times.

  14. String players performing in detache style tend to crescendo a little bit when they anticipate the bow change for the next note. I’ve never played winds, but I wouldn’t be surprised if they did something similar.

  15. For instruments that comprise a different physical source for every pitch (pianos, mallet instruments, pipe organs, harmonicas), try detuning each key by a tiny, fixed amount to emulate imperfections in the instrument. You can use a lookup table, but my favorite approach is to use the pitch to seed a random number generator; I use the Hasher UGen in SuperCollider a lot for this. Timbral parameters can be similarly randomized in this manner.

  16. Haas effect panning: delay one channel and put it through a random EQ that’s sloped for high-frequency loss.

  17. SuperCollider’s PitchShift is really great for adding a weird metallic “splash” which I mix back into the signal, sometimes fairly subtly. In general, find weird effects and use them in parallel at a low volume to simulate little imperfections that add up.

  18. The notion that bass has to be mono is a total myth, and only matters today if your music is being pressed to vinyl. (Trust me, nobody is summing your stereo channels.) Low instruments can absolutely have stereo image and using that can make a mix sound less electronic.

  19. If simulating an ensemble, some instruments will be further away than others, which can be mimicked with higher wet-dry ratio in a reverb and some high-frequency loss. This helps particularly for homogeneous ensembles like string orchestras.

  20. Wind and string instruments require significant dexterity or physical exertion to reach the higher notes in their registers. To mimic this, random detuning should be more dramatic for higher notes.

  21. Winds and strings playing legato are typically done with rapid portamento, which sounds fine. Realism can be further improved by briefly fading in some noise and/or high passing the source, simulating instability as the instrument physically transitions between consecutive notes.

  22. Saw and pulse waves are pretty obviously synthetic and often need considerable massaging to sound remotely acoustic. Consider other sources if you aren’t getting success with those. Breakpoint synthesis (GENDY) is a favorite of mine.

  23. Common mixing advice states that you should avoid putting effects or processing on anything unless necessary as demonstrated by A/B test. This is a wise idea for recordings, but in mostly synthesized music, a long chain of subtle effects can create an “imperfection cascade” that help get an extra 5% of realism. This only really helps if the sounds are already good enough to stand on their own.

  24. Unisons of multiple instruments can sound more realistic as a whole than exposed instruments, since they can mask each other’s imperfections, especially if those instruments are very different from each other.

  25. Slow parameter fluctuations happen not only for individual instruments, but for the entire ensemble, especially for a homogeneous group like a string quartet. Create an automation and map it to the intensity parameter of many instruments, which also fluctuate individually.

Composing with Accelerating Rhythms

Thanks to all who checked out my album Haywire Frontier. Yesterday, I gave a remote talk for the NOTAM SuperCollider meetup on the project. The talk wasn’t recorded, but I decided to rework it into prose. This is partially for the benefit of people that missed the event, but mostly because I’m too lazy to research and write a new post wholly from scratch this month.

It’s not necessary to listen to the album to understand this post, but of course I would appreciate it.

Acceleration from notes to an entire piece

One of the earliest decisions I had to make while planning out Haywire Frontier was how to approach rhythm. I’m a huge fan of breakcore and old school ragga jungle (Venetian Snares’ work convinced me to dedicate my life to electronic music), and partially as a result of that, unpitched percussion and complex rhythms are central to a lot of my output.

However, I resolved pretty early on that I didn’t want the rhythmic material of the project to fall into the grids and time signatures of dance music. My reasons for this are nebulous and difficult to articulate, but I think a big part is that I wanted to challenge myself. When I make beat-based music, which I do frequently, I tend to think relative to established genres like drum-‘n’-bass or techno or house, and I mimic the tropes of what I want to imitate. Removing those guardrails, while still trying to make music conducive to active listening, puts me out of my comfort zone. I like to put myself in creative situations where I feel a little awkward or uncomfortable, because if there’s anything I personally fear in my creative output, it’s complacency. [1]

So beats are out. An alternative, which I have used a lot in the past, is a type of randomized rhythm I call the “i.i.d. rhythm,” or “Pwhite-into-\dur rhythm:”

SuperCollider code:

// NB: Full aggregated code from example, plus SynthDefs, are at the end of the post.
    loop {
        s.bind { Synth(\kick) };
        rrand(0.03, 0.6).wait;

In these rhythms, the inter-onset intervals (IOIs), or time between successive hits, are chosen with a single random distribution. In statistics terms, the IOIs are i.i.d., or independently and identically distributed. The distribution is uniform in this example, but you can use log-uniform, or any distribution over the positive real numbers.

Every SuperCollider user has written one of these rhythms at some point. They’re perfectly serviceable for some applications. However, for rhythmic material that drives an entire percussion section, I have to admit that I find these tiresome and uninspiring. In one word, what these rhythms lack is phrasing.

If you were to grab a non-musician, give them a snare drum, and ask them to hit it “randomly,” their result would be nothing like this. They might produce a cluster of rapid hits, then silence, then a nearly steady rhythm, and modulate between all those approaches. That’s to say nothing of a free jazz drummer who’s spent years training to produce complex, compelling rhythms that may not fall on a grid. It’s well known to psychologists that humans are very bad at producing data that passes randomness tests; I view it as Geiger-counter-type rhythms failing to pass humanity tests.

Read more…

Nathan Ho - Haywire Frontier

Album cover for Haywire Frontier. Digital drawing of an androgynous figure, mid-leap, brandishing two swords above their head.

My first full-length solo album, Haywire Frontier, is releasing on Saturday, September 9th on the Japanese label Tokinogake. It is available for preorder now, and you can listen to the opening track “Trickster Deity.”

Here are the liner notes:

In 2008, at the age of 11, I created Googology Wiki on my parents’ computer. “Googology” is word for the study of large numbers and fast-growing functions, deriving from the 9-year-old Milton Sirotta’s coinage of the term “googol.” The website was never meant to go beyond my personal use, and I gradually drifted away from it. Fifteen years later, it has grown to tens of thousands of articles and a community of hundreds of active users.

Haywire Frontier is a 40-minute musical tribute to a strange corner of amateur mathematics whose growth I somewhat-inadvertently catalyzed, with rhythmic and formal material deriving from Georg Cantor’s “ordinal number” system, integral to the study of large numbers.

The album was sequenced and synthesized entirely in SuperCollider with no samples, external hardware, or third-party plugins.


  • John Tejada, mastering

  • Isa Hanssen (Instagram), cover art

  • Special thanks to a0n0, Charlie Burgin (Sahy Uhns), William Fields, RM Francis, Joonas Siren (Forces), Ben Tillotson, Nathan Turczan.

I expect to write about this project in the near future. Thank you for listening, and for all your support.

An Intro to Wavelets for Computer Musicians

I wasn’t able to get this post fully complete in time for my self-imposed monthly deadline. I have decided to put it up in an incomplete state and clean it up in early September. I hope it is informative even in its current condition, which gets increasingly sketchy towards the end. Open during construction.

Among DSP types, those unfamiliar with wavelets often view them as a mysterious dark art, vaguely rumored to be “superior” to FFT in some way but for reasons not well understood. Computer musicians with a penchant for unusual and bizarre DSP (for instance, people who read niche blogs devoted to the topic) tend to get particularly excited about wavelets purely for their novelty. Is the phase vocoder too passé for you? Are you on some kind of Baudelairean hedonic treadmill where even the most eldritch Composers Desktop Project commands bore you?

Well, here it is: my introduction to wavelets, specifically written for those with a background in audio signal processing. I’ve been writing this post on and off for most of 2023, and while I am in no way a wavelet expert, I finally feel ready to explain them. I’ve found that a lot of wavelet resources are far too detailed, containing information mainly useful to people wishing to invent new wavelets rather than people who just want to implement and use them. After you peel back those layers, wavelets are surprisingly not so scary! Maybe not easy, but I do think it’s possible to explain wavelets in an accessible and pragmatic way. The goal here is not to turn you into a wavelet guru, but to impart basic working knowledge (with some theory to act as a springboard to more comprehensive resources).

Before we go further, I have to emphasize an important fact: while wavelets have found many practical uses in image processing and especially biomedical signal processing, wavelets are not that common in audio. I’m not aware of any widely adopted and publicly documented audio compression codec that makes use of wavelets. For both audio analysis alone and analysis-resynthesis, the short-time Fourier transform and the phase vocoder are the gold standard. The tradeoffs between time and frequency resolution are generally addressable with multiresolution variants of the STFT.

There is no one “wavelet transform” but a huge family of methods. New ones are developed all the time. To limit the scope of this post, I will introduce the two “classical” wavelet transforms: the Continuous Wavelet Transform (CWT) and Multiresolution Analysis (MRA). I’ll also go over popular choices of individual wavelets and summarize their properties. There are other wavelet transforms, some of more musically fertile than CWT or MRA, but you can’t skip the fundamentals before moving on to those. My hope is that demystifying wavelet basics will empower more DSP-savvy artists to learn about these curious creatures.

Read more…

The Duration Trick

The Duration Trick is something I was egotistical enough to believe I discovered, but after a recent conversation with William Fields (a musical hero of mine) I have learned that I’m in no way the first to come across it. Both Fields and the Max/MSP Shop Boys have been using something like this for a while, I’m told. Learning about this case of convergent evolution spurred me to bump up this already-planned post in the queue.

Simply put, the Duration Trick is when a synthesizer patch with discrete note on/off events is given advance notice of the duration of each note. Thus, if the sequencer is modeled as sending messages to a synthesizer:

Every “note on” message is accompanied with an anticipated duration.

It’s possible for such a patch to exactly anticipate the ending of the note, so note offs don’t even need to be transmitted, although I like to give the option to cut off the note prematurely. Additionally, duration can function as a velocity-like parameter that impacts other synthesis parameters such as amplitude or brightness, so short and long notes differ in more ways than just timing.

Imagine a monophonic subtractive synthesis patch with a lowpass filter that gradually opens for each note on. Traditionally, the lowpass filter’s trajectory is independent of the note duration, and may run its course or be cut short:

With duration information, it’s possible to guarantee that the lowpass filter reaches a precise goal at the end of each note:

I find the second example slightly more exciting in this simple demonstration. For a more complex example, jump about an hour into a video session I uploaded in May. The Duration Trick may sound like a small change at first, but it had a pretty radical impact on my music and sound design when I started using it. It shines especially for transitional sweeps that need to arrive right on time. Arguably, anyone who drags a reverse cymbal sample leading up to a drop is in a sense using the Duration Trick.

Note off events in MIDI can arrive at any time, so the Duration Trick isn’t achievable with standard traditional synthesizer hardware in the absence of some CC-based hack. (This is one of many reasons that pigeonholing everything into MIDI events has had long-term negative effects on music tech, but I digress.) The Duration Trick is therefore easiest to implement in one of the “nerd” music software environments like Csound, SuperCollider, etc., particularly anything that permits scripting. The trick is possible in a real-time context, but the sequencer must of course be able to look ahead far enough to know the durations at all, so it’s more semi-real-time than fully real-time. Durations are always available for music sequenced offline, and are generally available in algorithmic composition as well.

Musicians who play or sing melodies generally don’t think in individual note ons and offs, but rather phrases and gestures if not something higher-level. Even the most reactive and on-the-fly improvisations often require calculating at a few notes ahead, and this will impact subtleties of playing style. The Duration Trick alone doesn’t capture the complexities of musicians playing acoustic instruments, but it still appears to be a valuable stepping stone to breathing some more life into a synth patch.

Correlated Granular Synthesis

Decades after Curtis Roads’ Microsound, granular synthesis is making appearances here and there in the commercial plugin market. While it’s nice to see a wider audience for left-field sound design, I have my quibbles with some of the products out there. From what I’ve heard, so many of these products’ demos are covered in reverb in obvious compensation for something, showing that the plugins seem most suited for background textures and transitional moments. In place of sound, the developers seem to prioritize graphics — does watching 3D particles fly around in a physics simulation inspire the process of music production, or distract from it?

Finally, and most importantly, so many granular “synths” are in fact samplers based on buffer playback. The resulting sound is highly dependent on the sampled source, almost more so than the granular transformations. Sample-based granular (including sampling live input such as in Ableton Live’s Grain Delay) is fun and I’ve done it, but in many ways it’s become the default approach to granular. This leaves you and me, the sound design obsessives, with an opportunity to explore an underutilized alternative to sampled grains: synthesized grains.

This post introduces a possibly novel approach to granular synthesis that I call Correlated Granular Synthesis. The intent is specifically to design an approach to granular that can produce musical results with synthesized grains. Sample-based granular can also serve as a backend, but the idea is to work with the inherent “unflattering” quality of pure synthesis instead of piggybacking off the timbres baked into the average sample.

Correlated Granular Synthesis is well suited for randomization in algorithmic music context. Here’s a random sequence of grain clouds generated with this method:

Read more…

A Preliminary Theory of Sound Design

This post is my attempt at explaining my own philosophy of sound design. It’s not in final form, and subject to amendment in the future.

The type of sound design I refer to is specific to my own practice: the creation of sounds for electronic music, especially experimental music, and especially music produced using pure synthesis as opposed to recorded or sampled sound. These ideas presented might have broader applications, but I have no delusions that they’re in any way universal.

The theory I expound on isn’t in the form of constraints or value judgements, but rather a set of traits. Some are general, some specific, and my hope is that considering how a sound or a piece relates to these traits will take me (and possibly you) in some directions not otherwise considered. Some of the traits contain assertions like “if your sound does X, your audience will feel Y,” which in a composition may be employed directly, carefully avoided, or deconstructed. You’ll also note that the theory is concerned mostly with the final product and its impact on the listener, not so much the compositional or technical process. (Friends and collaborators are well aware that I use highly idiosyncratic and constrained processes, but those are less about creating music and more about “creating creating music.”)

No theory of sound design will replace actually working on sound design. Sound design isn’t a spectator sport, nor a cerebral exercise, and it has to be practiced regularly like a musical instrument. Reading this post alone is unlikely to make you a better sound designer, but if it’s a useful supplement to time spent in the studio, I’d consider this work of writing a success.

I will deliberately avoid talking about the topics of melody, harmony, counterpoint, consonance vs. dissonance, and tuning systems, and I’ll only talk about rhythm abstractly. There are many existing resources dedicated to these topics in a wide variety of musical cultures.

Read more…

Audio Texture Resynthesis

Spectrograms of the audio signals later in the post.

Left: spectrogram of a child singing. Right: spectrogram of resynthesized audio.


I was alerted to audio texture resynthesis methods by a student of mine who was interested in the collaborative work of researcher Vincent Lostanlen, musician Florian Hecker, and several others [Lostanlen2019] [Lostanlen2021] [Andén2019] [Muradeli2022]. Their efforts are built on an analysis method called “Joint Time-Frequency Scattering” (JTFS) based on the Continuous Wavelet Transform. In an attempt to understand the work better, I binged a wavelet transform textbook, [1] implemented a simplified version of JTFS-based resynthesis, and and briefly exchanged emails with Lostanlen. His helpful answers gave me the impression is that while JTFS is a powerful analysis technique, resynthesis was more of a side project and there are ways to accomplish similar effects that are more efficient and easier to code without compromising too much on musicality.

Audio texture resynthesis has some history in computer music literature [Schwartz2010], and some researchers have used resynthesis to help understand how the human brain processes audio [McDermott2011].

After some experimentation with these methods, I found that it’s not too hard to build a simple audio texture resynthesizer that exhibits clear musical potential. In this blog post, I’ll walk through a basic technique for making such a system yourself. There won’t be any novel research here, just a demonstration of a minimum viable resynthesizer and my ideas on how to expand on it.


The above-mentioned papers have used fancy techniques including the wavelet transform and auditory filter banks modeled after the human ear. However, I was able to get decent results with a standard STFT spectrogram, then using phase reconstruction to get time-domain audio samples. The full process looks like this:

  1. Compute a magnitude spectrogram \(S\) of the time-domain input signal \(x\). A fairly high overlap is advised.

  2. Compute any number of feature vectors \(F_1(S),\, F_2(S),\, \ldots,\, F_n(S)\) and define their concatenation as \(F(S)\).

  3. Initialize a randomized magnitude spectrogram \(\hat{S}\).

  4. Use gradient descent on \(\hat{S}\) to minimize the error \(E(\hat{S}) = ||F(S) - F(\hat{S})||\) (using any norm such as the squared error).

  5. Use phase reconstruction such as the Griffin-Lim algorithm on \(\hat{S}\) to produce a resynthesized signal \(\hat{x}\).

The cornerstone of making this algorithm work well is that we choose an \(F(S)\) that’s differentiable (or reasonably close). This means that the gradient \(\nabla E\) can be computed with automatic differentiation (classical backpropagation). As such, this algorithm is best implemented in a differentiable computing environment like PyTorch or Tensorflow.

The features \(F(S)\), as well as their relative weights, greatly affect the sound. If \(F(S)\) is highly time-dependent then the resynthesized signal will mimic the original in evolution. On the other hand, if \(F(S)\) does a lot of pooling across the time axis then the resynthesized signal will mostly ignore the large-scale structure of the input signal. I’m mostly interested in the latter case, where \(F(S)\) significantly “remixes” the input signal and disregards the overall structure of the original.

We will represent \(S\) as a 2D tensor where the first dimension is frequency and the second is time. As a matrix, each row is an FFT bin, and each column a frame.

If using a fancy alternative to the magnitude spectrogram such CWT or cochlear filter banks, you may have to do gradient descent all the way back to the time-domain samples \(x\). These analysis methods break down to linear frequency transforms that produce complex numbers followed by computing the absolute value of each bin, so differentiability is maintained.

Read more…