Continuous Gesture Recognition and TemporalMapping
Several authors pointed out the importance of continuous interaction in DMIs and movement sonification. Recent approaches focus on extending gesture recognition and analysis methods to include continuous representations of gestures in terms of temporal and/or dynamic variations. Bevilacqua et al. (2005, 2010) developedGesture Follower for continuous gesture recognition and following. The system is built upon a templatebased implementation of Hidden Markov Models (HMMs). It can learn a gesture from a single example, by associating each frame of the template gesture to a state of a hidden Markov chain. At runtime, the model continuously estimates parameters characterizing the temporal execution of the gesture. In particular, the system performs a real-time alignment of a live gesture over a reference recording, continuously estimating the time progression within the reference template.
The TemporalMapping paradigm introduced by Bevilacqua et al. (2011) takes advantage of this system for continuously synchronizing audio to gestures. The results of the temporal alignment computed using gesture follower is mapped to a granular or phase-vocoding audio engine which realigns the audio track over the live performance of the gesture (see figure 2.1). Thus, modeling the accurate time structure of the gesture provides a new way of interacting with audio processing, mainly focused on the temporal dimension of the sound. However, the restriction to single-example
learning limits the possibility of capturing the expressive variations that intrinsically occur between several performances of the same gesture.
Caramiaux et al. (2014a) extended this approach with an adaptive system based on particle filtering. Gesture Variation Follower (GVF) is a templatebased method allowing to track several features of the movement in realtime: its time progression but also a set of variations, for example the offset position, size, and orientation of two-dimensional gestures. Caramiaux et al. show that the model is efficient for early and continuous recognition. It consistently tracks gesture variations, which allows users to control continuous actions through gesture variations. The variations estimated by the system must be programmed as a specific state-space model, and therefore need to be adapted to each use-case.
Multilevel temporal models
In Gesture Follower, each gesture is represented as a single time profile, and the gestures are independent during the recognition process. This assumption can be limiting: representing gestures as unbreakable units does not enable the creation of complex time structures in musical performance.
Widmer et al., investigating artificial intelligence methods to analyze musical expressivity, underline the need for multilevel models: “Music performance is a multilevel phenomenon, with musical structures and performance patterns at various levels embedded in each other.” (Widmer et al., 2003).
Recent findings about pianists’ finger tapping emphasize two factors constraining musical gestures: biomechanical coupling and chunking Loehr and Palmer (2007). Introduced by Miller in the fifties Miller (1956), chunking suggest that “perceived action and sound are broken down into a series of chunks in people’s mind when they perceive or imagine music” Godøy et al. (2010). More than just segmenting a stream into small entities, chunking refers to their transformation and construction into larger and more significant units. Jordà (2005, 2008) argued for the need of considering different control levels allowing for either intuitive or compositional decisions.
Recently, Caramiaux (2012) highlighted the intricate relationship existing between hierarchical temporal structures in both gesture and sound when the gesture is performed in a listening situation. Thus, the design of sys2.2 tems implementing action-sound mapping should take into account different levels of temporal resolution, organized in a hierarchical structure.
Other fields of study such as speech processing (Ostendorf et al., 1996; Russell, 1993) and activity recognition (Aggarwal and Cai, 1997; Park and Aggarwal, 2004; Guerra-Filho and Aloimonos, 2007; Duong et al., 2009) exhibit a growing interest for hierarchical representations. Several extensions ofHiddenMarkovModels have been proposed to address its independence limitations. In the Segmental Hidden Markov Model (SHMM) (Ostendorf et al., 1996), each hidden state emits a sequence of observations, or segment, given a geometric shape and a duration distribution. The model has been successfully applied to time profile recognition of pitch and loudness (Bloit et al., 2010) and was exploited for gesture modeling in a recent study (Caramiaux et al., 2012). However, the model is not straightforward to implement for online inference. As an alternative, we proposed in previous work to use a hierarchical extension of HMMs with a structure and learning method similar to Gesture Follower. We developed and evaluated a real-time implementation of the Hierarchical Hidden Markov Model for the context of music performance (Françoise, 2011).
Software forMapping Design withMachine Learning
Many methods for discrete gesture recognition have been implemented as Max or Pure Data externals, and number of machine libraries are available online for different languages and platforms7. Among others, the SARC Eyesweb catalog8 (Gillian and Knapp, 2011) andGesture Recognition Toolkit9 (Gillian and Paradiso, 2014) implement number of gesture classification algorithms (SVM, DTW, Naive Bayes, among others). TheWekinator10 (Fiebrink, 2011) —detailed thereafter in Section 2.3.3—implements a wide range of methods from the Weka machine learning toolbox, such as Adaboost, Neural Networks, and HiddenMarkovModels. Several models for continuous gesture recognition and following are also available, such asGesture Follower11 (Bevilacqua et al., 2010) and Gesture Variation Follower (GVF)12 (Caramiaux et al., 2014a).
Programming by Demonstration
Programming-by-Demonstration is a related field of computer science that studies tools for end-user programming based on demonstrations of the target actions. One of the primary challenges in Programming-by-Demonstration (PbD) is to go beyond this simple reproduction of actions to the generalization of tasks or concepts.
Hartmann et al. (2007) highlight the difficulty for interaction designers to map between sensor signal and application logic because most tools for programming interaction are textual and are rarely conceived to encourage rapid exploration of design alternatives. They introduce the Examplar system that provides a graphical interface for interactive visualization and filtering of sensor signal, combined with pattern recognition based on Dynamic Time Warping (DTW). A qualitative user study shows that their approach reduces the prototype creation time and encourages experimentation, modifications and alternative designs through direct user experience assessment. A similar approach is adopted by Lü and Li (2012) for the case of multitouch surface gestures.
Programming-by-Demonstration has become an primary methodology in robotics, as way to teach robot interactions from human examples. This body of work, in particular its thread focusing on robot motor learning, is described later in Section 2.5.3.
The PbD methodology has been applied to the field of interactive computer music. Merrill and Paradiso (2005) described a PbD procedure for programming musical interaction with the FlexiGesture interface. The system implements a classification-triggering paradigm for playing sound samples, along with continuous mappings where only the range of the input sensor is adapted to the range of the parameter.
InteractiveMachine Learning in ComputerMusic
Our review ofmapping strategies inDigitalMusical Instrument design highlights that in sound and music computing, complex mappings involving many-to-many associations are often more preferred to simple triggering paradigm. However, most approaches to interactive machine learning have focused on discrete tasks such as classification.
To overcome such issues and fit the context of interactive computermusic, Fiebrink’sWekinator (2011) implements various machine learning methods for both recognition and regression in a user centered workflow illustrated in Figure 2.2. The Wekinator encourages iterative design and multiple alternatives through an interaction loop articulating configuration of the learning problem (selection of features and algorithm), creation/editing of the training examples, training, and evaluation.
Mapping has often been defined as the layer connecting motion sensors and sound synthesis parameters (Rovan et al., 1997). This definition is obviously advantageous from a technical point of view, as it is restricted to a set mathematical operation between parameters. However, sound control parameters might not always be musically, perceptually or metaphorically relevant, nor sensors or their extracted motion features might be relevant to movement perceptive or expressive attributes. Elaborate sound synthesismodels, such as physical models (Rovan et al., 1997), already integrate a significant amount of translation of the input parameters—e.g. the speed and pressure of a bow, — to musically relevant parameters such as pitch, loudness, etc. Obviously, describing the mapping alone is not sufficient to understand the implications in terms of the action-perception loop, and it should systematically be described in conjunction with the movement input device, sound synthesis, and all other intermediate operations (pre-processing, feature extraction, parameter mapping). Recently, Van Nort et al. (2014) proposed topological and functional perspectives on mapping that argue for considering themusical context as a determinant in mapping design: […] we must remember that mapping per se is ultimately tied to perception and, more directly, to intentionality. In the former case thismeans building a mapping around certain key action–sound associations through the design of appropriate correspondences of system states. In the latter case, this means conditioning this association towards the continuous gestures that will ultimately be experienced. (Van Nort et al. (2014))
This brings out the alternative perspective of considering mapping as the entire action–perception loop that relates the performer’s movements to the resulting sound.
Table of contents :
A B S T R ACT
R É SUMÉ
1 INT RODUCT ION
1.1 Background and General Aim
1.2 ProbabilisticModels: from recognition to synthesis .
1.3 Mapping by Demonstration: Concept and Contributions
1.4 Outline of the Dissertation
2 B ACKGROUND & R E L AT ED WORK
2.1 Motion-SoundMapping: fromWires toModels
2.1.2 Mapping with PhysicalModels
2.1.3 TowardsMapping by Example: PointwiseMaps and geometrical properties
2.1.4 Software forMapping Design
2.2 Mapping withMachine Learning
2.2.1 An Aside on Computer vision
2.2.2 Discrete Gesture Recognition
2.2.3 ContinuousGesture Recognition and TemporalMapping
2.2.4 Multilevel temporalmodels
2.2.5 Mapping through Regression
2.2.6 Software forMappingDesign withMachine Learning
2.3 InteractiveMachine Learning
2.3.1 InteractiveMachine Learning
2.3.2 Programming by Demonstration
2.3.3 InteractiveMachine Learning in ComputerMusic
2.4 Closing the Action-Perception Loop
2.4.1 Listening in Action
2.4.3 MotivatingMapping-by-Demonstration .
2.5 Statistical Synthesis andMapping inMultimedia .
2.5.1 The case of Speech: from Recognition to Synthesis
2.5.2 Cross-ModalMapping from/to Speech
2.5.3 Movement Generation and Robotics
3 MAPPING BY DEMONS TRATION
3.2 Definition and Overview
3.3 Architecture and Desirable Properties
3.3.2 Requirements for InteractiveMachine Learning .
3.3.3 Properties of the ParameterMapping
3.4 The notion of Correspondence
3.5 Summary and Contributions
4 PROBABI LISTIC MOVEMENT MODE L S
4.1 MovementModeling using GaussianMixtureModels
4.1.3 Number of Components andModel Selection .
4.1.4 User-adaptable Regularization
4.2 Designing Sonic Interactions with GMMs
4.2.1 The Scratching Application
4.2.2 From Classification to Continuous Recognition
4.2.3 User-Defined Regularization
4.3 MovementModeling using HiddenMarkovModels .
4.3.4 Number of States andModel Selection
4.3.5 User-Defined Regularization
4.4 Temporal Recognition andMapping with HMMs .
4.4.2 User-defined Regularization and Complexity
4.4.3 Classification and Continuous Recognition
4.5 Segment-levelModeling with Hierarchical HiddenMarkov Models
4.6 Segment-levelMapping with the HHMM
4.6.1 ImprovingOnlineGesture Segmentation and Recognition
4.6.2 A Four-Phase Representation ofMusical Gestures
4.6.3 SoundControl Strategies with Segment-levelMapping
4.7 Summary and Contributions
5 ANALYZING ( WI TH) PROBAB I L I S T IC MODE L S : A US E -CA S E IN TA I CHI P E R FORMANCE
5.1 Tai ChiMovement Dataset
5.1.1 Tasks and Participants
5.1.2 Movement Capture
5.1.3 Segmentation and Annotation
5.2 Analyzing with Probabilistic Models: A Methodology for HMM-basedMovement Analysis
5.2.1 Tracking temporal Variations
5.2.2 Tracking Dynamic Variations
5.3 Evaluating Continuous Gesture Recognition and Alignment
5.4 Getting the RightModel: ComparingModels for Recognition
5.4.1 Continuous Recognition
5.4.2 Types of Errors
5.4.3 ComparingModels for Continuous Alignment
5.4.4 Gesture Spotting: Models and Strategies .
5.5 Getting the Model Right: Analyzing Probabilistic Models’ Parameters
5.5.1 Types of Reference Segmentation
5.5.2 Model Complexity: Number of Hidden States and Gaussian Components
5.6 Summary and Contributions
6 P ROB A B I L I S T IC MODE L S FOR SOUND PARAMETER GENERATION
6.1 GaussianMixture Regression
6.1.1 Representation and Learning
6.1.3 Number of Components
6.2 HiddenMarkov Regression
6.2.1 Representation and Learning
6.2.2 Inference and Regression
6.2.4 Number of States and Regularization
6.2.5 Hierarchical HiddenMarkov Regression
6.3 Illustrative Example: Gesture-basedControl of PhysicalModeling Sound Synthesis
6.3.1 Motion Capture and Sound Synthesis
6.3.2 Interaction Flow
6.4.1 Estimating Output Covariances
6.4.2 Strategies for Regression withMultiple Classes
6.5 Summary and Contributions
7 MOVEMENT AND SOUND ANALYSIS US INGHIDDEN MARKOV R EGR E S S ION
7.1.2 Time-based Hierarchical HMR
7.1.3 Variance Estimation
7.2 Average Performance and Variations
7.2.1 Consistency and Estimated Variances
7.2.2 Level of Detail
7.3 Synthesizing Personal Variations
7.4.1 Synthesizing Sound Descriptors Trajectories
7.4.2 Cross-Modal Analysis of VocalizedMovements
7.5 Summary and Contributions
8 P L AY ING SOUND T E X TUR E S
8.1 Background andMotivation
8.1.1 Body Responses to Sound stimuli
8.1.2 Controlling Environmental Sounds
8.2 Overview of the System
8.2.2 Sound Design
8.3 Siggraph’14 Installation
8.3.1 Interaction Flow
8.3.2 Movement Capture and Description
8.3.3 Interaction withMultiple Corpora
8.3.4 Textural and Rhythmic Sound Synthesis
8.3.5 Observations and Feedback
8.4 Experiment: Continuous Sonification of Hand Motion for Gesture Learning
8.4.1 RelatedWork: LearningGestureswith Visual/Sonic Guides
8.5 Summary and Contributions
9 MOT ION- SOUND INT E R ACT ION THROUGH VOCA L I Z AT ION
9.1 RelatedWork: Vocalization in Sound andMovement Practice
9.1.1 Vocalization inMovement Practice
9.1.2 Vocalization inMusic and Sound Design .
9.2 System Overview
9.2.1 Technical Description
9.2.2 Voice Synthesis
9.3 The Imitation Game
9.3.1 Context andMotivation
9.3.2 Overview of the Installation
9.3.5 Discussion and FutureWork
9.4 Vocalizing DanceMovement for Interactive Sonification of Laban Effort Factors
9.4.1 RelatedWork onMovementQualities for Interaction
9.4.2 Effort in LabanMovement Analysis
9.4.3 Movement Sonification based on Vocalization
9.4.6 Discussion and Conclusion
9.6 Summary and Contributions
10 CONCLUS ION
10.1 Summary and Contributions
10.2 Limitations and Open Questions
A A P P ENDI X
A.2 The XMM Library
A.2.1 Why another HMMLibrary?
A.2.4 Max/MuBu Integration
A.2.5 Example patches
A.2.6 Future Developments
A.2.7 Other Developments
A.3 Towards Continuous Parametric Synthesis
A.3.1 Granular Synthesis with Transient Preservation .
A.3.2 A Hybrid Additive/Granular Synthesizer
A.3.3 Integration in aMapping-by-Demonstration Syste
A.3.4 Qualitative Evaluation
A.3.5 Limitations and Future Developments
B I B L IOGR A PHY