Adversarial Learning based Anonymization – Project topics materials

Get Complete Project Material File(s) Now! »

A brief historical overview of speech processing and privacy

Speech processing came a long way since 1881 when the earliest device for recording speech was invented by Alexander Graham Bell. It used a rotating cylinder coated with wax over which up-and-down grooves could be cut by a stylus responding to the acoustic pressure generated by the sound wave. One can only imagine the tremendous challenges posed by this device to record, process, and store speech signals. Thankfully it has been replaced by microphones which capture the acoustic pressure from sound waves and record it as a relative change in voltage. There are several such historical advancements in speech technology that facilitated convenient and large-scale speech processing, eventually leading to the current privacy crisis. Particularly, Homer Dudley’s work [66] inspired several generations of researchers to focus on making speech the mainstream medium for human-computer interaction, which propelled the large-scale storage of speech data and overall, the domain of speech signal processing forward.
Recall the “speaking machine” invented by von Kempelen introduced in Section 1.1 which could produce a few human-like sounds. In the mid-1800s, Sir Charles Wheatstone improved upon its design [321] using adjustable and configurable leather resonators capable of producing many more speech-like sounds. This model was adopted by Homer Dudley to design an electrical speech synthesizer [64] for Bell Labs. The synthesizer could be operated as a piano with hand controls to switch between voiced and unvoiced sounds, keys to control the characteristics of the signal and a foot pedal to control the pitch. It was called the VODER (Voice Operation Demonstrator) and was first demonstrated at the New York World’s Fair in 1939. This event attracted the focus of researchers worldwide leading to several speech interest groups in the community. Dudley also pioneered the field of speech coding [ 270] which aims to represent speech signals for efficient storage and transmission by exploiting their inherent redundancies. He provided the analysis-synthesis [65, 67] method for speech coding. The initial usage of speech technology was predominantly envisioned in a controlled setting, such as offices and research labs, where storage is limited, and through experience and training the people being recorded gradually became cautious to not divulge private information in the collected data. The recent advances that have led speech interfaces to enter our homes at the consumer level are quite new, and the privacy-related implications of this technology are still being explored. As of today, speech interfaces are present in personal mobile phones as well as digital assistants which have a widespread consumer base. Exposing an unaware user to such advanced technology will open the doors for potential adversaries to exploit the sensitive attributes present in the speech signal.
Several researchers have studied the security and privacy vulnerabilities of digital assistants [72, 166, 84], and their third-party applications [172]. The two most concerning privacy issues are the “always listening” feature and the cloud storage of the audio queries. The device remains in the inert state of buffering and re-recording until the wake word is spotted [135], it then records the audio and sends it to a cloud-based service for ASR and natural language understanding (NLU). All the audio files are usually stored in the user’s account and can be accessed by logging into the account. This data may contain sensitive details about the user’s life, such as bank details. A compromised account can lead to a user’s private speech data being leaked to the public. Due to the rich nature of speech signal as we described earlier, not only the linguistic content but many other attributes of the speaker may become known to a malicious entity.
Extensive surveys of digital assistant users have been conducted to understand their mental models, beliefs, attitudes, and concerns towards their devices. Some studies [1, 166] show that users have an incorrect understanding of the working of digital assistants and the third-party services with which their sensitive data is shared. They are also unaware of the existing privacy controls in the digital assistant architecture. Malkin et al. [191] show that half of the users are not aware of the permanent retention policy of audio queries in the user’s account. Users are not aware of existing privacy features and they express the need for automatic deletion of their recordings. Huang et al. [128] studied users’ behaviour and privacy concerns when a digital assistant is shared among several housemates. Bispham et al. [27] present a taxonomy of attacks on speech interfaces which motivates future research on voice privacy to focus on exact vulnerabilities present in such devices. These surveys make some recommendations to users and manufacturers such as turning off the microphone when not in use, updating the firmware with the latest release, strict data deletion policies, and screening of sensitive content. The above studies are indicative of the fact that users of speech interfaces are gradually becoming more aware of the underlying mechanisms and more concerned about their privacy being leaked through the interface. In this thesis, we aim to propose speaker anonymization techniques that will protect users’ identity at the source, without requiring them to put in significant effort. These techniques can be built in directly into the device firmware by the manufacturers.

Principles and tools of speech processing

Now let us introduce some basic principles and tools of speech processing behind the proposed methods. We start with the basics of speech as a signal, and how it is processed to extract relevant features with physiological and phonetic considerations. We give a brief account of artificial neural networks due to their pervasive use as statistical models in speech processing tasks. Then, we describe the technology behind the three most popular speech applications that enable the design and evaluation of our proposed methods: automatic speech recognition, speech synthesis, and automatic speaker recognition.

Fundamentals of speech processing

In this section, we briefly discuss the mechanism of human speech production followed by its representation and processing as a discrete-time signal. Vocal tract. The physiological apparatus that generates speech is called the vocal tract [112], which starts at the lungs and ends at the lips and the nostrils. The larynx (also called the voice box) separates the vocal tract into two anatomical regions: the lower part is called the sublaryngeal region and the upper part is called the supralaryngeal region. The sublaryngeal region of the vocal tract is composed of the diaphragm, the lungs, and the trachea (also called the windpipe). The air flows outward from the lungs and encounters a pair of flap-like structures in the larynx, called the vocal folds. When the vocal folds are held at an intermediate tension so that they are not too close or too far apart, the movement of the air induces ripples along their length. This causes them to vibrate, and the result is voicing. Voicing is the cause of periodic segments in the speech signal which are called voiced regions. On the contrary, when the vocal folds are held at sufficient distance from each other so that air flows freely through them, they do not vibrate, which results in voicelessness. This can be observed in the speech signal as aperiodic segments which look like random noise and are known as unvoiced regions.
The supralaryngeal region, which is composed of the oral cavity and the nasal cavity, plays a major role in determining the exact nature and quality of the sounds that are produced. The different parts of the supralaryngeal region that contribute towards the articulation of different vowels and consonants are referred to as articulators. The major articulators in the oral cavity are the lips, the teeth, the tongue, the alveolar ridge, the hard palate, and the velum. Among these, the tongue and the lower lip are the active articulators, whereas the others are passive and immobile. The complex interaction between active and passive articulators to completely stop the airflow, constrict it through a narrow channel, or allow it to pass through without restriction gives us the vast variety of speech sounds found in all of the world’s languages.
Phonemes. The vocal tract is a continuous system capable of producing infinitely many sounds. These sounds are called phones. The exact physical mechanism of producing phones by the vocal tract, their transmission in acoustic space, and their auditory perception by the human ear are studied under the branch of linguistics called phonetics [112], which is independent of language. A given language can have only a small, finite number of sound units that can be used to compose words in that language and have some grammatical significance. These sounds must be perceivably distinct from each other for effortless communication and are called phonemes. The organization of phonemes, their combinations to produce words, and their semantic role in language are studied under the branch of linguistics called phonology [35]. Phonology categorizes the continuous signal produced by the vocal tract into discrete phoneme classes based on their acoustic, articulatory, and perceptual characteristics. Most languages feature two broad classes of phonemes, namely vowels, that are voiced sounds produced with no obstruction by the articulators, and consonants, that are produced by obstructing the airflow passing through the vocal tract. Although every language has a different set of phonemes, the International Phonetic Alphabet (IPA) [267] describes the universal set of phonemes based on their articulatory characteristics. Vowels are described based on the position of the tongue and the roundedness of the lips. The tongue is a highly active articulator and is subdivided into the front, central and back parts which can move somewhat independently of each other. It can also be placed at different heights to control the width of the constriction in the vocal tract. For example, /i/ as in “feed” is made by placing the front part of the tongue close to the hard palate, hence it is categorized as a close front vowel without rounding, while /o/ as in “foe” is made by raising the back of the tongue up to a certain height and rounding the lips, hence it is a close-mid back vowel with rounding. Consonants are categorized based on the presence or absence of voicing, the place of articulation that indicates the place of constriction in the vocal tract, and the manner of articulation which is the method of air release. For instance, /p/ as in “pan”, is a voiceless consonant made by completely blocking the airflow using the lips, hence it is categorized as a voiceless bilabial plosive, whereas /z/ as in “zoo”, is a voiced consonant produced by making a narrow constriction by placing the tip of the tongue close to the alveolar ridge, therefore it is a voiced alveolar fricative.
Such categorization of phonemes also helps us understand the linguistic behaviour of speakers when a sound is missing in their language [155, 163]. Generally, the non-native speakers retain voicing and manner, but replace the place of articulation, for example, the sound of consonant /D/ as in the English word “the” is a voiced dental fricative that is not available in the French language, hence most native French speakers replace it with /z/ which is a voiced alveolar fricative [156]. Similarly, some dialects of Hindi do not have the phoneme /S/ as in “sheep” which is a voiceless postalveolar fricative, hence they replace it with /s/ as in “sun” that is a voiceless alveolar fricative. Speech in the time domain. Sound is a pressure wave traveling through the air as the medium of prop-agation. It can be recorded by measuring the variation in pressure at a single point in space over time. As mentioned before, a microphone is used to record the acoustic wave which measures the relative change in pressure as the electrical signal that is proportional to the pressure variation. Figure 2.1(a) shows the output of the microphone (for the word “privacy”), also called a waveform or a time-domain signal, pronounced by a male or a female speaker. The duration of both waveforms is shorter than one second. To represent a speech signal digitally, we must select the bit depth which is the finite precision needed to encode the amplitude values, and the sampling rate (denoted as Fs) which defines how many times per second the actual waveform is sampled to obtain discrete values of the amplitude. The duration, the bit depth, and the sampling rate decide the memory requirement to store the audio file. The audio file can be stored in a lossless uncompressed format (e.g., “.wav”) or a lossy compressed format where the file size is reduced while maintaining good audibility (e.g., “.mp3”). A discrete-time speech signal can be represented as s and s[n] denotes a single sample of instantaneous amplitude value, where n = 0, . . . , Ns − 1. As described further, the speech signal is generally analyzed to determine its frequency components in a short duration.

READ Africa at the Beginning of the 21st Century: Despair and Hope

Automatic speech recognition

Automatic speech recognition (ASR) aims to convert an utterance into its textual content, also called transcription. The output of ASR is used by natural language understanding systems that take speech as input, and is widely deployed in commercial applications ranging from cloud servers to mobile devices. We mentioned before that speech utterances are of varying duration and the system producing them also varies through time, hence they are processed as a sequence of T overlapping time frames of fixed duration. The input to ASR is a sequence represented as a matrix, O = [o1, . . . , oT ]⊤ ∈ RT ×A of length T time frames, where ot ∈ RA are feature vectors derived from the speech signal, e.g., MFCCs or logmel spectra, and the output is the estimated word sequence ˆ . This problem can be formulated as [330]: W W = W | (2.14) ˆ argmax P (W O).

Table of contents :

List of figures
List of tables
List of acronyms
1 Introduction
1.1 Motivation
1.2 Scope and objectives
1.3 Summary of contributions
1.4 Publications
1.5 Thesis structure
2 Background and RelatedWork
2.1 A brief historical overview of speech processing and privacy
2.2 Principles and tools of speech processing
2.2.1 Fundamentals of speech processing
2.2.2 Artificial neural networks
2.2.3 Automatic speech recognition
2.2.4 Speech synthesis
2.2.5 Automatic speaker recognition
2.3 Techniques to transform speaker information
2.3.1 Adversarial learning for speech
2.3.2 Speech transformation
2.3.3 Voice conversion
2.4 Machine learning based anonymization methods
2.5 The speaker anonymization task
2.6 Summary of techniques
3 Privacy Evaluation using Informed Attackers
3.1 Attack model and the notion of attackers’ knowledge
3.2 Voice conversion methods
3.2.1 VoiceMask
3.2.2 VTLN-based voice conversion
3.2.3 Disentangled representation based voice conversion
3.3 Target selection strategies and exploitable parameters
3.3.1 Target selection strategies
3.3.2 Exploitable parameters
3.4 Performance metrics
3.4.1 Privacy measures
3.4.2 Utility measures
3.4.3 Comparison of privacy metrics
3.5 Experimental setup
3.5.1 Data and evaluation setup
3.5.2 Voice conversion settings
3.6 Experimental comparison with different attackers
3.7 Experimental comparison of privacy metrics
3.7.1 Exhibiting differences and blindspots through simulation
3.7.2 Evaluation on real anonymized speech
3.8 Summary
4 Adversarial Learning based Anonymization
4.1 Alternative ASR architecture
4.2 Proposed model
4.2.1 Baseline end-to-end ASR model
4.2.2 Speaker-adversarial model
4.3 Experimental setup
4.3.1 Data sets
4.3.2 Network architecture
4.3.3 Training
4.3.4 Evaluation metrics
4.4 Results and discussion
4.5 Summary
5 X-vector based Anonymization
5.1 X-vector based voice conversion
5.2 The first VoicePrivacy challenge
5.2.1 Anonymization task
5.2.2 Data sets
5.2.3 Objective metrics
5.2.4 Anonymization baselines
5.2.5 Results
5.3 Design choices in x-vector space
5.3.1 Anonymization framework
5.3.2 Proposed design choices
5.3.2.1 Distance metric
5.3.2.2 Proximity
5.3.2.3 Gender selection
5.3.2.4 Assignment
5.3.3 Experimental setup
5.3.3.1 Data
5.3.3.2 Algorithm settings
5.3.3.3 Privacy evaluation
5.3.3.4 Utility evaluation
5.3.4 Results and discussion
5.3.4.1 Speaker’s perspective
5.3.4.2 User’s perspective
5.3.4.3 Attacker’s perspective
5.3.5 Pitch conversion
5.4 Large-scale speaker study
5.4.1 Data
5.4.2 Privacy evaluation metrics
5.4.3 Experimental setup
5.4.4 Average-case analysis
5.4.5 Worst-case analysis
5.5 Usability of anonymized speech data
5.6 Summary
6 Removing Residual Speaker Information—Towards Provable Guarantees
6.1 Proposed approach
6.1.1 Overview
6.1.2 Differentially-private pitch extractor
6.1.3 Differentially-private BN extractor
6.2 Empirical validation
6.2.1 Experimental setup
6.2.2 Results and discussion
6.3 Summary
7 Conclusions and Perspectives
References