Augmenting topologies applied to ASV – Project topics materials

Get Complete Project Material File(s) Now! »

Issues with current ASV status

In many practical use-cases ASV technology is often seen by users as somewhat cum-bersome and non-secure, making it not worth the eﬀort. This is in contrast with the non-intrusive nature of using speech as an interface, which is the main reason behind the increasingly wider adoption of voice-driven smart assistants. Both Amazon Echo and Google Home assistants have some speaker recognition capabilities but, as of today, they are not enabled by default and their use is confined to automatic user preference setting or to lowering the false acceptance rate for wake-up word detection 1. As a result, this feature is left mostly unused. If ASV was as mature as speech recognition on these kinds of devices, not only would its use be more publicised and widespread, but users would use their voice as a means to seamless authentication in a secured environment. The road to reach this goal follows two complementary directions: eﬃciency and security. Eﬃciency refers to the perspectives of both the implementation and the end user: in order that ASV be deployed successfully on embedded devices, computationally-and memory-hungry approaches are incompatible; this is the case with complex deep neural network structures which require hundred of thousands of multiplications (see Fig. I.1) often on top of feature extraction. Also, to preserve convenience, an eﬃcient system should require very little user speech in order to operate reliably. The meaning of security is application-dependent: the level of security for a ASV-enabled parental control filter is diﬀerent to that of an ASV system that is meant to control access to bank accounts. Nevertheless, any ASV functionality would be rendered useless if it could not distinguish between a real human and a recorded voice, a so-called replay attack. Fig. I.2 2 illustrates how the error rate of an ASV system can be highly impacted when using replay attacks instead of genuine impostors. This, and other types of artificial voice attacks show the need for countermeasures, which in this domain are referred to as anti-spoofing.

RSR2015 corpus

The RSR2015 corpus [37] is a short-utterance, text-dependent database. Its male gender partition and related protocols described below were used extensively for the work reported in Chapters III and IV. RSR2015 was released almost in tandem with the HiLAM system presented in Section II.3.5; in fact, most of the experimental work involving HiLAM was performed on this corpus [35, 45] which is nonetheless distributed with protocols suited to the assessment of HiLAM-based text-dependent speaker verification systems. The RSR2015 database is one of the most versatile and comprehensive databases for such research. The particular speaker/part/session all-combinations structure illustrated in Fig. II.7 is what made RSR well suited to the work reported in Chapters III and IV. The more recent RedDots [46] corpus does not reflect this structure. RSR2015 contains speech data collected from both male and female speakers and is partitioned into 3 evenly-sized subsets whose usual purpose is for background modelling, experimental development and evaluation. Each subset is comprised of 3 parts: phonetically-balanced sentences (part I), short commands (part II) and random digits (part III). Each part contains data collected in one of nine sessions. Three of these sessions are reserved for training while the remaining six are set aside for testing. The three training sessions are recorded using the same smart device whereas the six testing sessions are recorded using two different smart devices (i.e. the user kept the same mobile phone or tablet for the training sessions while two other were used for the testing sessions, but the devices themselves differ between users).

Training

When used in conjunction with the HiLAM system, the background speakers are used solely to build the UBM. Not all the data is used, though: in order to have no speaker nor text content overlap, the UBM should be built avoiding data from the part which would be used to train the other layers. For example, consider the end goal is to test speaker recognition accuracy on phonetically balanced pass-phrases (part I). First, the Table II.2: The four different trial types used to assess the performance of a textdependent speaker verification system. They involve different combinations of matching speakers and text. A trial should be accepted only when both match.

HiLAM baseline implementation

This section describes our implementation of the HiLAM system that forms the baseline for the work reported here. The system architecture is illustrated in Fig. III.1. Also presented are results for our specific implementation assessed using the RSR2015 database.

Preprocessing and feature extraction

Silence removal is first applied to raw speech signals sampled at 16 kHz. This is performed according to ITU-T recommendation P.561 which specifies an active speech level of 15.9 1http://www.itu.int/rec/T-REC-P.56-201112-I/en dB. In our implementation this results in the removal of approximately 36% of the original data. The remaining 64% is then framed in blocks of 20ms with 10ms overlap. The feature extraction process is standard and results in 19 static Mel frequency cepstral coefficients (MFCC) without energy (C0). These are appended with delta and doubledelta coefficients resulting in feature vectors of 57 dimensions.

READ Impurity in an interacting medium: a perturbative approach

GMM optimisation

The number of Gaussian components is empirically optimised. The literature shows that higher values (512-2048) are often used for text-independent tasks [20, 56] or with systems based on i-Vector and PLDA techniques [51, 57]. In contrast, lower values (128- 256) are typically used in text-dependent tasks and techniques such as HiLAM [24, 37]. We obtained the best performance with 64 Gaussian components.

Relevance factor optimisation

Concerning the relevance factor of MAP adaptation (see Section II.3.2 and Eq. II.4), the best performance is delivered with comparatively higher and lower values of 1 and 2 respectively. More precisely, 1 was set to 19, still inside what is considered to be the « insensitive » interval (8-20) according to the literature [20]; 2 was set to 3, as lower values are usually better suited for text-dependent tasks [58]. This means that during both adaptation stages, the middle layer model/data are given less weight. At each MAP adaptation stage, the new weight, mean and variance estimates share the same relevance factor (see Equations II.5, II.6 and II.7).
Scoring can be obtained by calculating the probability of the observation given the HMM model as explained in II.3.3, but the best results were obtained by averaging the log-likelihood ratios (between the claimed text-dependent speaker model and the UBM) across the five states.

Middle-layer training reduction

In order to assess the necessity of text-independent enrolment, a first sequence of experiments was conducted where the number of text-independent utterances used for layertwo training was successively subsampled according to the configurations explained in Section III.2. Secondly, the middle layer was trained with the exact same data that would be later used to adapt the third layer (with this configuration the middle layer is text-dependent and the number of middle-layer models is the same as the third layer HMM models); while a somewhat questionable choice, this configuration allows for just three repetitions of the desired pass-phrase at enrolment time while keeping the 3-layer structure and acts as an intermediate step towards the complete removal of the middle layer.

Table of contents :

Abstract
Acknowledgements
I Introduction
I.1 Speaker recognition terminology
I.2 Issues with current ASV status
I.3 Contributions and publications
I.4 Thesis structure
I.4.1 Part A
I.4.2 Part B
II A review of traditional speaker verification approaches
II.1 Speech as a biometric
II.2 Front-end: speaker features
II.2.1 Short-term features
II.2.2 Longer-term features
II.3 Back end: models and classifiers
II.3.1 Gaussian Mixture Models
II.3.2 GMM-UBM
II.3.3 Hidden Markov Models
II.3.4 Towards i-vectors
II.3.5 The HiLAM system
II.4 Performance Metrics
II.4.1 Receiver Operating Characteristic (ROC)
II.4.2 Equal Error Rate (EER)
II.4.3 Score normalisation
II.5 Challenges and Databases
II.5.1 NIST Speaker Recognition Evaluations
II.5.1.a The early years
II.5.1.b Broader scope and higher dimensionality
II.5.1.c Bi-annual big data challenges
II.5.1.d SRE16
II.5.2 RSR2015 corpus
II.5.2.a Training
II.5.2.b Testing
II.6 Summary
IIISimplified HiLAM
III.1 HiLAM baseline implementation
III.1.1 Preprocessing and feature extraction
III.1.2 GMM optimisation
III.1.3 Relevance factor optimisation
III.1.4 Baseline performance
III.2 Protocols
III.3 Simplified HiLAM
III.3.1 Middle-layer training reduction
III.3.2 Middle layer removal
III.4 Evaluation Results
III.5 The Matlab demo
III.6 Conclusions
IV Spoken password strength
IV.1 The concept of spoken password strength
IV.2 Preliminary observations
IV.2.1 The text-dependent shift
IV.2.2 The text-dependent overlap
IV.3 Database and protocols
IV.4 Statistical analysis
IV.4.1 Variable strength command groups
IV.4.2 Sampling distribution of the EER
IV.4.3 Isolating the influence of overlap
IV.5 Results interpretation
IV.6 Conclusions
V A review of deep learning speaker verification approaches
V.1 Neural networks and deep learning
V.1.1 Deep Belief Networks
V.1.2 Deep Auto-encoders
V.1.3 Convolutional Neural Networks
V.1.4 Long short-term Memory Recurrent Neural Networks
V.2 Deep learning in ASV
V.2.1 Feature extraction
V.2.2 Applications to i-vector frameworks
V.2.3 Back-ends and classifiers
V.3 End-to-end
V.3.1 Middle-level representations VS raw audio
V.3.2 Fixed topologies
V.4 Summary
VIAugmenting topologies applied to ASV
VI.1 Evolutionary strategies
VI.1.1 TWEANNs
VI.1.2 NEAT
VI.2 Application to raw audio classification
VI.3 Truly end-to-end automatic speaker verification
VI.3.1 Fitness function
VI.3.2 Mini-batching
VI.3.3 Training
VI.3.4 Network selection for evaluation
VI.4 Experiments
VI.4.1 Baseline systems
VI.4.2 NXP database and experimental protocols
VI.4.3 End-to-end system: augmentation and generalisation
VI.5 Further experiments: End-to-end system on NIST SRE16 data
VI.6 Conclusions
VIIAugmenting topologies applied to anti-spoofing
VII.1 A brief overview of anti-spoofing
VII.2 NEAT setup
VII.2.1 Ease of classification
VII.2.2Training
VII.2.3Testing
VII.3 Experimental setup
VII.3.1 Database, protocol and metric
VII.3.2 Baseline systems
VII.3.3 End-to-end anti-spoofing
VII.4 Experimental results
VII.4.1Evolutionary behaviour
VII.4.2 Spoofing detection performance
VII.5 Conclusions
VIIIConclusions
VIII.1 From the laboratory into the wild
VIII.2 Not all sentences are created equal
VIII.3 Truly end-to-end ASV
VIII.4 Truly end-to-end anti-spoofing
VIII.5 Closing thoughts and future work
Appendix A Published work
Bibliography