Skip to main content

Gesture Synthesis: An Introduction

Although the boundaries are often blurred, those who work with modular synthesizers and similar environments can divide musical signals into two categories: audio signals, which carry sounds in the audible frequency range, and control signals, which are lower-frequency signals that act as slow time-series data for modulation and triggering. (Control signals are not necessarily infrasonic, as they can contain e.g. sharp discontinuities that create frequency content in the audio range.) In SuperCollider, the .ar and .kr method selectors make this division explicit.

Many methods exist for synthesizing and processing audio signals, several of which we've talked about in the synthesis and effects tags. But what about control signals? I can think of standard LFO's, random LFO's, chaotic oscillators, arithmetic operations, filters, comparators, thresholding, logic gates, envelope generators, and feedback across any of these. I probably missed some, but those are the basics. It's a fine collection that's served analog aficionados for decades, and the tools are advanced enough to make things such as neural networks. One of the world's most adored Eurorack modules, Make Noise Maths, is specifically for creative generation and processing of control signals.

Modulation is the soul of sound synthesis. It can make or break a patch. Once I've come upon a patch that I like, my instinct -- arguably one induced by the SuperCollider environment -- is to modulate every parameter with a separate envelope generator or a random LFO and see what comes out. I've gotten good sounds out of this occasionally, but often it gives me lifeless and uninteresting patches that lack the humanity and expression of real instruments. Thinking about what I'm aiming for, the term "gesture" comes to mind, which I'd define as multidimensional modulation sources that come together to produce a sound with direction and meaning. It's subjective, but hopefully gives an idea.

An important component of gestural modulation is correlation of the various signals involved. If you imagine a skilled violinist playing a note in spiccato style, the many degrees of freedom of their right hand have to come together to produce the perfect stroke of the bow. If one could break down the various forces involved, the resulting signals would be highly correlated. Instantaneous feedback is also critical. A violinist can immediately see the position of their hand and bow, and hear the timbre they're producing, allowing a feedback loop of self-correction.

What approaches do we have for synthesis of gestures? To create correlated signals, my independent LFOs can be eschewed in favor of a single LFO and mapped to many synthesis parameters. To add variety, the LFO can be put through different filters before reaching each parameter. A crude form of feedback can be accomplished by taking the resulting audio signal, running it through machine listening features such as amplitude and brightness, and using these control signals to modulate the LFO in feedback.

Another method to achieve quality gestures is via a physical interface. This is the heart of many projects appearing in the NIME conference -- physical sensors produce tightly related multidimensional data and provide instant feedback, both desirable properties for gesture-like synthesis. For those of us who don't want to muck with building a physical interface, touchscreens are a great option. One of my musical heroes, the artist lodsb, described a patch in a ModWiggler post:

TCData [an iPad app] is sending just 7 midi CCs, pitchbend and polyphonic notes based on distance, speed, number of fingers, center of the finger group and x, y positions of the group. On the software side/daw I am controlling an instance of Massive (hah) with some FM-ish patch, mainly controlling the noise amount, a filter (cutoff + bandwidth), the reverb level and oscillator amount. Not much but you'll hear that such touch input can lead to entirely different "music" and a pretty diverse control over the timbres while still retaining control.

I recommend visiting that link and listening to his sound snippet. The quality of the resulting synthesis is a good testament to the efficacy of the method.

Since everyone has a smartphone now, multitouch, accelerometer, and gyroscope data are quite easy to gather -- so if you're looking to spice up a patch, try connecting it to your phone. As a low-tech solution, I built a simple offline-friendly Web app that lets you record six-dimensional multitouch time-series data and download it as JSON. There's a lot to be desired, and it's not a great user experience to have to download a file to your phone and then transfer it to a computer. Still, I've gotten some mileage out of it. I am interested in eventually hooking it up to a WebSockets server and using it to control real-time synthesizers, but that's beyond the scope of this blog post.

A future area of exploration could be to record tons of data in this Web app to create a large corpus of gestures that can be drawn from at random. It also seems like a good fit for training a recurrent neural network to generate new gestures.

One part of human anatomy with many, many degrees of freedom is our vocal apparatus. These can be measured unintrusively by recording the human voice and using linear predictive coding analysis (particularly the PARCOR coefficients) to control synthesizer data. Even using the amplitude and pitch of your voice to control musical parameters can be quite powerful and expressive especially if done in real-time.

An excellent resource on creating lifelike modulation signals doesn't come from music at all: the twelve principles of animation. I have a mere Wikipedia understanding of these, but many of them are quite applicable to music: anticipation, slow in and slow out, secondary action, and squash and stretch.

I didn't invent gesture synthesis, nor did I even give it its name. For a sophisticated prior invention, Nick Longo of Cesium Sound filed US6066794A in 1997: "Gesture synthesizer for electronic sound device." It is worth reading his later essay titled "The Theory That Led to Gesture Synthesis." The patent itself is very complex, and I can't claim to understand the full thing, but I'll try my best to describe it in broad strokes. The invention is explained as a long series of modules that process input gesture data in series. Here they are.

Hysteresis module: Longo says it best:

Motion performed by a musician is usually accomplished using two muscles, one pulling in the direction of motion, and one pulling in the opposite direction to provide braking force. Muscles have a different force activation characteristic when contracting than when expanding. When the direction of motion is reversed, as when performing vibrato, the roles of the muscles are also reversed. The result is an inversion of the characteristic shape of the gesture in the forward and reverse direction.

The present invention seeks to emulate a musician's muscular interaction with an instrument. Accordingly, when modifying control data using the gesture synthesis modules, it is necessary to treat the data in the forward direction in a different, and usually opposite way from the data moving in the reverse direction.

The hysteresis module thus takes an input signal and splits it into "forward" and "reverse" signals. This can be accomplished with simple thresholding, a Schmitt trigger, or increasing vs. decreasing. It is not clear to me what happens to the other signal when one signal is activated, but I suspect that the deactivated signal is set to zero.

Salience modulation module: "Salience" is here a term for a stateless nonlinearity with a curve. A function that increases on the interval [0, 1] is said to have "positive salience" if its second derivative is mostly negative, and "negative salience" if its second derivative is mostly positive. The salience is modulated by an external signal to simulate the movement of antagonistic pairs of muscles. The forward and reverse signals are processed in parallel, each with opposite salience. For example, the forward signal is processed with positive salience while the reverse signal is processed with negative salience, or vice versa.

Time oscillator module: This models how muscles respond to electrical pulses from the nervous system. The forward signal is used to modulate the frequency of a clock signal. Each clock trigger cumulatively adds a fixed value to a sample-and-hold signal, producing an upward staircase. The reverse signal is put through analogous processing, but produces a downward staircase.

Flex filter module: This makes use of a nonlinear equation called Hill's equation [Hill1938]_. (Not to be confused with the biochemical Hill's equation by the same Hill, or Hill's differential equation by a different Hill.) Hill's equation is

\begin{equation*} (P + a)(v + b) = \text{constant} \end{equation*}

where \(P\) is the tension in the muscle, \(a\) and \(b\) are constants, \(v = dx/dt\) is the velocity at which the muscle is contracting, and \(x\) is the muscle contraction. Rearranging:

\begin{equation*} \frac{dx}{dt} = \frac{\text{constant}}{P + a} - b \end{equation*}

so if we have the muscle tension \(P\), a hyperbola-shaped function gives us the rate of change. Letting the forward and reverse signal outputs of the time oscillator module be \(P\), we apply the above equation and integrate to produce the muscle contraction \(x\). There appears to be logic that resets the integrators to prevent them from running out of control, but I don't understand it.

Waveshaping module: This appears to be another stateless nonlinearity, but with curious S-shaped waveshaping functions. Different waveshaping functions are applied to the forward and reverse signals, and a lot of controls are presented to the user.

Scale module: Here the forward and reverse signals are first added together to produce a single signal. The scale module then maps this signal to a musical scale, so that the pitch bend range concerns scale degrees rather than semitones (or so I think).

Delay module: The final module runs the control signal through three parallel variable delay lines, which are then mixed together to produce the final signal. One is a "position-dependent delay" where the delay depends on the signal itself, simulating "force due to an elastic load." Another is a "velocity-dependent delay" where the delay is dependent on the derivative of the signal, simulating "the viscous damping element of muscle systems." The third delay line is a simple modulatable constant delay, simulating "force due to friction."

Did I get everything right? I have no idea. This signal chain is claimed to produce highly natural-sounding modulations. This doesn't mean that automatically applying the above will make your signal sound better -- as always, it's all in the details of how these parameters are set and used. I'm certain that this patent represents only a small fraction of the research Longo has put into his work, but it may be a good jumping off point.

I hope this has been an interesting discussion on the possibilities of gesture synthesis in music, possibly with applications to other fields such as animation, game development, and VJing. Having completed this post, I'll never use another LFO again.


Researching this post has certainly changed the way I think of LFOResearching this post has certainly changed the way I think of LFOss.. [Hill1938] Hill, A. V. 1938. "The heat of shortening and the dynamic constants of muscle."

Docutils System Messages

System Message: ERROR/3 (<string>, line 53); backlink

Unknown target name: "hill1938".