Speaker change detection with contextual information

Get Complete Project Material File(s) Now! »

Voice activity detection

Voice activity detection (VAD), also referred to as speech activity detection (SAD), performs the task of segmenting an audio stream into speech and non-speech content. Non-speech content refers to silence, background noises coming from the environment, e.g. clatter, clapping, laughter or music. This kind of content is considered non-informative for the tasks related to voice biometrics. In speaker recognition, non-speech content is assumed to aﬀect the robustness of the solutions. The same applies for SD where, however, a following clustering step may be widely aﬀected by nuisances in the form of A precise VAD is consequently of great importance for the SD task and its performance. The value of detecting speech content and not miss-classifying it as non-speech is obvious as (i) performance is decremented by default in an unrecoverable manner, and (ii) the missed-speech data will force whichever speaker modelling technique next in the SD pipeline to operate on fewer speech data. This leads to diminished robustness and a clustering more prone to errors. A similar logic applies to the incorrect classification of non-speech as speech data. Non-speech audio segments will contaminate the clustering process by misguiding the merging/splitting steps involved, and deteriorating results.
In terms of algorithms, two main kind of approaches are considered when it comes to VAD. A first one is that based on energy levels present in the signal. These kind of algorithms are considerably accurate when it comes to controlled scenarios in which energy levels in the signal are maintained within consistent ranges. This is a somewhat acceptable constraint in some speaker recognition scenarios such as telephony [3], where speakers are expected to be continuously close to the microphone and background noises are limited in variety. Such a limitation cannot, however, cannot be guaranteed in SD audio files, which are characterised by being considerably lengthier than their speaker recognition counterparts in a variety of domains, e.g. broadcast news, meeting environments. A much wider variability is thus present, limiting their applicability in the field. Most approaches to VAD are consequently not energy-based. On the contrary, thanks to the relatively easy labelling task of this 2-class problem, large amounts of training date are readily available, motivating the success of model-based approaches to VAD, i.e. a model is previously trained on background data in order to discern between speech and non-speech content. Traditional methods have relied on Gaussian mixture models (GMMs) [36,37], with proposed modifications allowing them to adapt iteratively over test data [38]. These GMM models are usually trained to represent speech and non-speech acoustic classes, although in the presence of richer labelling in the training data which includes sub-classes of non-speech, e.g. music or noise, extra classes may also be considered. The relationship to the acoustic features is assigned by means of Viterbi decoding [39].
Model-based VAD methods do however suﬀer when facing domain mismatches. Robustness against such mismatches has become increasingly available thanks to many modern methods developed to leverage developments in deep learning (DL), achieving state-of-the-art performance. Many diﬀerent architectures have been proposed, including variations of feed-forward DNNs [40], convolutional neural networks (CNNs) [41,42], or long-short term memory (LSTM) neural networks [43,44]. A VAD system following the approach introduced in [44] is used in some of the work reported in Chapters 7, 8 and 9.

Segmentation and speaker change detection

Following the pipeline of the traditional SD system comes the speaker segmentation and/or speaker change detection (SCD) module. Given that SD operates upon multi-speaker audio streams, it seems reasonable to perform some sort of pre-clustering processing that allows to separate speech segments from diﬀerent speakers into homogeneous segments. Considering that a whole chapter of this thesis relates about this task (see Chapter 5) this section is deliberately brief in the justification and implications of SCD. However, a few methods of interest for the read of Chapter 5 are discussed here.
Approaches to SCD may be divided with regard to their dependence to external training data. The simplest approaches to SCD rely on implicit methods to segmentation. Examples of such approaches would be a VAD-based segmentation in which the output of the VAD system is assumed to separate speech segments into single-speaker speech fragments. Such an approach may be an option in conversational speech guided by a very clear structure in which interruptions between speakers do not happen, and speech turns are respected. A second implicit approach to SCD is that of segmenting the speech segments derived from the VAD system into shorter, homogeneous speech segments with a certain overlap. This approach relies on suﬃciently short speech segments to have been uttered by a single speaker. Whilst errors are very likely to occur by operating this way, it is hoped that a following clustering and/or resegmentation step(s) will correct them. More elaborated training-independent methods perform an explicit SCD relying on the computation of distance metrics. These measure the similarity between the speech content contained in two adjacent sliding windows. Speaker change points are hypothesised when the resulting score surpasses a certain empirically optimised threshold. A few examples follow:
Bayesian information criterion (BIC): The BIC was originally introduced in [45] as a metric to estimate the representativity of a model over the data on which it has been trained, by means of a likelihood criterion that is penalised by complexity. Given a set of N data points X , and being M a model fitted to represent X , then the BIC of M is defined as: BIC(M) = log(L(X , M)) − λ 1 #(M)log(N) (2.1).
where log(L(X , M)) is the log-likelihood of the data X given the model M, λ is a data-dependent penalty term, and #(M) denotes the number of parameters in the model M.

Segmentation and speaker change detection

The use of BIC for SCD was proposed in [46], where the task is regarded as an hypothesis test regarding the content of the adjacent sliding windows (herein represented by Xi and Xj) being analysed. In this context, the hypothesis H0 denotes both speech segments belong to a unique speaker and there is not a change point between Xi and Xj. Alternatively, H1 indicates both speech segments belong to diﬀerent speakers, indicating a possible change between speakers. A speaker change point is thus hypothesised using the increment ΔBIC between BIC(H1) and BIC(H0) so that:
ΔBIC = BIC(H1) − BIC(H0) = R(i, j) − λP (2.2).
where R(i, j) denotes the diﬀerence between the log-likelihoods of the two hypothesis and P is the complexity penalty term. Full details may be found in [46].
A common approach is using GMMs to model the hypothesised speech segments, turning Equation 2.2 into: ΔBIC = log(L(X , M)) − (log(L(Xi, Mi) + log(L(Xj, Mj)) − λΔ#(i, j)log(N) (2.3).

Hierarchical clustering

Hierarchical methods to clustering rely on some sort of initialisation of speaker clusters, i.e. random initialisation, segment-level initialisation, or speaker segmentation based initialisation, and operate upon iterative, nested operations of merging and separation of clusters. A thorough analysis of the diﬀerences in hierarchical approaches is given in [100], and its main to variants are introduced as follows:
Divisive hierarchical clustering: A first, less common approach is that of divisive hierarchical clustering (DHC) which develops a top-down, general-to-specific approach to speaker clustering. A single (or alternatively a small number) of clusters is used as seed for the clustering process, which is iteratively split into smaller speaker clusters until a stopping criterion is met and the diarization output is fixed. Examples of such systems are [101,102,103].
Agglomerative hierarchical clustering: Bottom-up agglomerative hierarchical clus-tering (AHC) approaches are more commonly used nowadays in SD systems. They operate in an opposite manner to that of DHC, and apply a specific-to-general methodology which allows initial clusters to be ideally purer from the initial iteration. A cluster-to-cluster similarity matrix is computed at each pass of the AHC algorithm to decide which clusters to merge before continuing with the iterative process. This approach to clustering is straight-forward, and the merging stops when similarity between clusters falls within an empirically optimised threshold, determining the final number of speakers in a session. Applications of AHC to SD are thus abundant, and continue to provide with state-of-the-art performance in some acoustic domains [99]. Whilst easy to put in practice, it may be argued that AHC incurs in what is a greedy sort of decision-making by not necessarily allowing for re-assignment of segment-to-cluster relationship at every iteration. This may be solved though by means of a simple segment-to-cluster re-assignment operation at the beginning of every clustering step. Such an approach is used in [104] and [105].
A slightly diﬀerent approach to AHC worth mentioning is that applied to SD via the information bottleneck (IB) approach [106,107] in [108], which operates upon acoustic features directly by using a non-parametric framework derived from the Rate-Distortion theory [109].

READ African philosophy and the notion of Ubuntu

Bayesian analysis

Another branch of research worth noting here is that which explores Bayesian analysis for the task of SD. A first application of variational Bayes to SD was proposed in [110], and further developed in [111,112] by leveraging the use of eigenvoice priors for VB inference. In parallel, non-parametric Bayesian diarization solutions were also being proposed by combining hierarchical Dirichlet processes (HDP) [113] with HMMs [114], and applied to SD in [115]. In combination of these two lines of work, the authors in [116] recently reported an enhanced version of that in [111] in that it incorporates the HMMs of [115] to model speaker transitions. The resulting work has achieved state-of-the-art performance in SD performance in various domains [116,117].

Resegmentation

A last, optional module in the pipeline of SD is that refining the boundaries generated by the clustering algorithm and/or including short segments which may have been removed for more robust clustering performance [129]. A traditional approach to resegmentation is that of an ergodic HMM in which speakers modelled by means of GMMs are used as HMM state distributions. Viterbi alignment is then computed to obtain the final result. Alternatives to resegmentation have however gained importance in recent years thanks to enhancements proposed in the literature. The VB inference methodologies such as that of [116] introduced above, have been extensively used as a resegmentation method following a first clustering solution derived from i-vectors/speaker embeddings, yielding positive increases in performance [7,99,130]. Neural networks have similarly allowed for refined boundaries definition. In [131], an initial IB diarization system is applied only to generate speaker pseudo-labels which may be used to train an artificial neural network (ANN) capable of enhancing the speaker-discriminative capacity of acoustic features. A similar approach is that successfully used in [132] by means of LSTM networks.

Speaker diarization

The most common metric in evaluating diarization performance is the diarization error rate (DER). This scoring method, originally proposed in the context of the National Institute of Standards and Technology (NIST) Rich Transcription (RT) evaluations [133], is thus used in the experimental results reported in this thesis. The DER considers errors derived from VAD, segmentation and clustering stages of the SD pipeline, and is defined as: DER = Espk + EF A + Emiss + EOV (2.13).

Table of contents :

Acknowledgements
List of Figures
List of Tables
Glossary
1 Introduction
1.1 Domain robust and efficient speaker diarization
1.2 Low-latency speaker spotting
1.3 Contributions and thesis outline
Publications
2 Literature review
2.1 Acoustic feature extraction
2.2 Voice activity detection
2.3 Segmentation and speaker change detection
2.4 Speaker modelling
2.5 Clustering
2.6 Resegmentation
2.7 Evaluation and metrics
2.7.1 Speaker diarization
2.7.2 Speaker recognition
2.8 Summary
I Domain robust and efficient speaker diarization
3 Binary key speaker modelling: a review
3.1 Introduction
3.2 Binary key background model
3.2.1 Speaker recognition
3.2.2 Speaker diarization
3.3 Feature binarization
3.4 Segmental representations
3.5 Similarity metrics for binary key speaker modelling
3.6 Recent improvements and use cases
3.6.1 Recent improvements for speaker recognition
3.6.2 Recent improvements for speaker diarization
3.6.3 Other applications
3.7 Baseline system for speaker diarization
3.8 Summary
4 Multi-resolution feature extraction for speaker diarization
4.1 Introduction
4.2 Spectral analysis
4.2.1 Short-time Fourier transform
4.2.2 Multi-resolution time-frequency spectral analysis
4.3 Proposed analysis
4.4 Experimental setup
4.4.1 Database
4.4.2 Feature extraction
4.4.3 BK speaker modelling configuration
4.4.4 In-session speaker recognition
4.4.5 Speaker diarization experiment
4.5 Results
4.5.1 Speaker recognition
4.5.2 Speaker diarization
4.6 Summary
5 Speaker change detection with contextual information
5.1 Introduction and related work
5.2 The KBM as a context model
5.2.1 KBM composition methods
5.3 BK-based speaker change detection
5.4 Experimental setup
5.4.1 Database
5.4.2 Baseline SCD system
5.4.3 Binary key SCD system
5.4.4 Evaluation metrics
5.5 Results
5.5.1 SCD using cumulative vectors
5.5.2 SCD using binary keys
5.5.3 Comparison between BK-based SCD systems
5.5.4 Speaker diarization using a BK-based SCD
5.6 Summary
6 Leveraging spectral clustering for training-independent speaker diarization
6.1 Context and motivation
6.2 The first DIHARD challenge
6.3 An analysis of our baseline
6.3.1 The baseline system
6.3.2 Experiments and results
6.3.3 Identifying the baseline strengths & weaknesses
6.4 Spectral clustering
6.4.1 Introduction and motivation
6.4.2 Spectral clustering and BK speaker modelling
6.4.3 Single-speaker detection
6.5 Experimental setup
6.5.1 Dataset
6.5.2 Feature extraction
6.5.3 KBM and cumulative vector parameters
6.5.4 Clustering parameters
6.5.5 Evaluation
6.6 Results
6.6.1 Spectral clustering upon CVs
6.6.2 Spectral clustering as a number-of-speakers estimator
6.6.3 Evaluation of the single-speaker detector
6.6.4 Domain-based performance
6.6.5 Results in the official DIHARD classification
6.7 Summary
7 System combination
7.1 Motivation and context
7.2 Baseline system modules
7.2.1 Feature extraction
7.2.2 Voice activity detection and segmentation
7.2.3 Segment/cluster representation
7.2.4 Clustering
7.2.5 Resegmentation
7.3 Fusion
7.3.1 Fusion at similarity-matrix level
7.3.2 Fusion at the hypothesis level
7.4 Experimental setup
7.4.1 Training data
7.4.2 Development data
7.4.3 Modules configuration
7.5 Results
7.5.1 Closed-set condition
7.5.2 Open-set condition
7.5.3 Conclusions and results in the challenge
7.6 Summary
II Low-latency speaker spotting
8 Speaker diarization: integration within a real application
8.1 Introduction
8.2 Related work
8.3 Low-latency speaker spotting
8.3.1 Task definition
8.3.2 Absolute vs. speaker latency
8.3.3 Detection under variable or fixed latency
8.4 LLSS solutions
8.4.1 Online speaker diarization
8.4.2 Speaker detection
8.5 LLSS assessment
8.5.1 Database
8.5.2 Protocols
8.6 Experimental results
8.6.1 LLSS performance: fixed latency
8.6.2 LLSS performance: variable latency
8.6.3 Diarization influences
8.7 Summary
9 Selective cluster enrichment
9.1 Introduction
9.2 Selective cluster enrichment
9.3 Experimental work
9.3.1 General setup
9.3.2 Results
9.4 Summary
10 Conclusions
10.1 Summary
10.2 Directions for future research
Bibliography