Bregman divergences and exponential families

Get Complete Project Material File(s) Now! »

Bregman divergences and exponential families

Although K-means provides a simple and eﬀective clustering algorithm, it lacks the flexibility of a probabilistic model and the ability to have soft-assignments of a data point to multiple clusters. Mixture models such as Gaussian mixture models are widely used to address these issues, and Banerjee et al. (2005b) show how one can use such models with Bregman divergences by establishing a bijection between Bregman divergences and exponential families (with some restrictions) and using the corresponding exponential family model as the emission distribution.
An exponential family distribution with (natural) parameter θ ∈ Θ ⊂ Rp and suﬃcient statistic φ(x) ∈ Rp is given by the density pθ (x) = exp( φ(x), θ − a(θ)), (2.10) with respect to some measure ν on the input space X (see, e.g., Wainwright and Jordan (2008); Barndorﬀ-Nielsen (1978)). The log-partition (or cumulant) function a ensures normalization and is given by a(θ) = log X exp( φ(x), θ )ν(dx). The base measure ν is usually of the form ν (dx) = h(x)λ(dx), where λ is either the Lebesgue measure (for continuous distributions) or the counting measure (for discrete distributions) and h(x) is sometimes called ancillary statistic. Most commonly used distributions, including normal, exponential, gamma, multinomial (with fixed number of trials), and many others, belong to this family. The suﬃcient statistic φ(x) is said to be minimal if there exists no constant c such that φ(x), θ = c ν-almost everywhere. An important property that we recall is that the derivatives of the log-partition function and the moments of the distribution are linked, and we have in particular ∇a(θ) = EX ∼pθ [φ(X )] (2.11)
In order to establish a bijection with Bregman divergence, we follow Banerjee et al. (2005b) and consider a restricted set of such distributions.
Definition 2.9 (regular exponential family). A regular exponential family distribution is an exponential family distribution for which the parameter space Θ is open and the suﬃcient statistic x ∈ X = Rp gives a minimal representation. Its density w.r.t. the Lebesgue or counting measure takes the form
pψ,θ (x) = h(x) exp( x, θ − ψ(θ)), (2.12) for x ∈ Rp, where we denote by ψ the log-partition function.
The log-partition function is always convex, and with the additional properties of this defini-tion, one can show that ψ and its conjugate ψ∗ satisfy Legendre-duality properties as in §2.2.2 (see Banerjee et al. (2005b); Barndorﬀ-Nielsen (1978) for details). If we write µ = ∇ψ(θ) – which corresponds to the mean of the distribution from (2.11) –, then θ = ∇ψ∗ (µ), and using (2.5), we have
pψ,θ (x) = h(x) exp( x, θ − ψ(θ))
= h(x) exp( x, θ − µ, θ + ψ∗ (µ))
= h(x) exp(ψ(x) − ψ(x) + ψ∗ (µ) + x − µ, ∇ψ∗ (µ) )
= h′ (x) exp(−Dψ∗ (x, µ)),
with h′ (x) = h(x) exp(ψ(x)). Therefore, a regular exponential familiy can be expressed in terms of the Bregman divergence associated to the conjugate of the log-partition function. Banerjee et al. (2005b) show that the divergences obtained this way belong to a special family which they call regular Bregman divergences, and establish a bijection between regular exponential families and regular Bregman divergences.
In practice, this means that when Dψ is a regular Bregman divergence and we are considering distortions Dψ (•, µ) to a centroid µ, we can rely on the corresponding regular exponential family distribution pψ∗ ,θ with θ = ∇ψ(µ) and with log-partition function ψ∗ , given by the density pψ∗ ,θ (x) = h(x) exp(−Dψ (x, µ)) (2.13)
in a probabilistic context. Note that the mean of this distribution is µ, and the maximum likelihood estimate of µ given observations x1 , , x n is given by µˆ = n−1 i xi, which is the same as computing the right-type Bregman centroid. It is hence a natural choice of distribution to be used as an emission/observation distribution in a mixture model.
We now give two examples of one-to-one correspondences given by this bijection, follow-ing Banerjee et al. (2005b).
Example 2.10. If we consider a Gaussian distribution with mean µ and fixed covariance σ2 I , taking the conjugate of the log-partition function gives ψ(x) = x 2 /2σ2 and thus the corre-sponding Bregman divergence is Dψ (x, µ) = x − µ 2 /2σ2 , as we might have expected.
Example 2.11. In the case of a multinomial distribution with fixed number of trials N , we need to make sure to use a minimal representation in order to establish the correspondence.
We can use x = [xj ]j∈{1,…,p−1} as the minimal suﬃcient statistic, where p j=1 xj = N , and θ = [log q j /q ] j∈{1,…,p−1} as the natural parameter (which lies in the open set Rp−1 ), where qj ≥ 0 are the event probabilities and are such that j=1 qj = 1. The log-partition function is then ψ∗ (θ) = N log(1 + form of KL-divergence
p−1 pxj xj
j=1 eθj ), and its conjugate is ψ(µ) = N log , leading to a
j=1 N N
p xj xj /N
Dψ (x, µ) = N log
j=1 N µj /N
Remark 2.12. Note that in a multinomial distribution, x is almost surely discrete since the base measure is discrete. The bijection implies that if Dψ is the KL divergence, the corresponding regular exponential family is a multinomial, hence also discrete, even though Dψ can potentially take any normalized vector x ∈ Rp+ . In practice, and in our experiments with the audio represen-tation described in § 2.1, one can fix a “number of trials” integer N , normalize x ∈ Rp+ such that j =1 xj = N and discretize each coordinate xj to a close integer while keeping the sum equal to N . Larger values of N can give a better, more granular approximation, but it is best to choose N empirically according to experimental results on data. Since our algorithms don’t require computing the ancillary statistic h(x) thanks to normalization, the non-integer representation can also be used directly without discretization as an approximation.

Mixture models and the EM algorithm

A mixture model with K mixture components is a probabilistic latent variable model which can be described by the following generative process
zi ∼ π, i = 1, , n
xi|zi ∼ pµz , i = 1, , n,
where zi ∈ {1, , K } is a latent random variable indicating which cluster observation xi belongs to. z ∼ π indicates that z follows a categorical distribution (or multinomial with 1 trial) with parameter π ∈ RK , k πk = 1. pµ k denotes the observation/emission distribution of mixture component k, with parameter µk . In the case of a Gaussian mixture model, the emission parame-ters are the means and covariance matrices, µk = (mk , Σk ). If we take our emission distributions to be in the regular exponential family corresponding to a Bregman divergence Dψ given by the density p(ψ∗ ,θ) , we can take a mean parameterization pµk = p(ψ∗ ,∇ψ(µk )) which takes the form pµk (x) = h(x) exp(−Dψ (x, µk )) (2.14)
We will call this a Bregman distribution associated with ψ.
A graphical model representation of the mixture model is given in figure 2.1. The joint distribution of the observed variables x = (xi)i=1..n and hidden variables z = (zi)i=1..n factorizes as follows p(x, z; π, µ) = p(zi; π)p(xi|zi; µ) i=1 n K K = ✶(zi =k) p(xi|zi = k; µk ) ✶(zi =k) , (2.15) πk
i=1 k=1 k=1
where p(xi|zi = k; µk ) = pµk (xi).
In order to estimate parameters π and µ = (µk )k=1..K from the observed data, we can use a maximum likelihood estimator and try to maximize the (log-)likelihood of the observed data, given by ℓ(π, µ) = log p(x; π, µ) = log p(xi; π, µ) = log p(xi, zi = k; π, µ) (2.16) i=1 i=1 k=1
Note that like minimizing the K-means objective, maximizing this log-likelihood is a non-convex problem and thus makes it hard to find a global maximum. It is possible nonetheless to find local maxima using ascent methods, which try increase the value of the log-likelihood after each iteration. A standard technique used for maximum likelihood estimation in latent variable models is the Expectation-Maximization (EM) algorithm (Dempster et al., 1977), which iteratively maximizes lower bounds on the likelihood.
Derivation of EM If we denote by x the set of observed variables, by z the hidden variables and by θ the set of parameters, we can use Jensen’s inequality to find a lower bound on the log-likelihood in terms of any probability distribution q on z ℓ(θ) = log p(x, z; θ) = logq(z) p(x, z; θ) q(z) ≥ q(z) log p(x, z; θ) (2.17)
The EM algorithm proceeds by maximizing this lower bound with respect to q and θ, in a coordinate-ascent fasion. Maximizing it with respect to the distribution q can be done by having an equality in Jensen’s inequality, thus having a tight bound. The equality case corresponds to having a constant in the expectation, i.e. q(z) ∝ p(x, z; θ), that is q(z) = p(z|x; θ), (2.18) which is known as the E-step, since it usually corresponds to computing expected suﬃcient statistics. If we now fix q in (2.17), we can now maximize this new lower bound as a function of θ, which, up to a constant term (the entropy of q), is equal to the expected complete-data likelihood under q EZ∼q [log p(x, z; θ)] (2.19)
This is known as the M-step. This optimization problem can often be solved in closed form, as it resembles a complete-data maximum-likelihood problem. Given an initial parameter θ(0) , the EM algorithm thus repeats the following two steps until convergence
• (E-step) Compute q using current parameter θ q(z) = p(z|x; θ(t) )
• (M-step) Update parameter θ θ(t+1) = arg max EZ∼q [log p(x, z; θ)]
Since the lower bound 2.17 increases after each step and is tight after each E-step, the log-likelihood increases after each iteration, that is, ℓ(θ(t+1) ) ≥ ℓ(θ(t) ), and thus converges, assuming it is bounded above3 . A good stopping criterion is then to see if the increase in likelihood is small. The rate of convergence of EM is linear4 , but is often preﬀered in practice to second-order methods because of its simplicity and because it generally suﬃces to quickly get a good fit to the sample data in typical machine learning applications (Xu and Jordan, 1996).
EM for mixture models In order to derive an EM algorithm for our mixture model, we can start by computing the expected complete-data log-likelihood used in the M-step, following Eq. (2.15)
EZ∼q [log p(x, z; π, µ)] = Eq [✶{zi = k}] log πk + Eq [✶{zi = k}] log p(xi|zi = k; µk ), where the τik := Eq [✶{zi = k}] = p(zi = k|xi) are computed using the previous parameter estimates in the E-step as follows τik = p(zi = k|xi; π, µ) = Z p(zi = k; π)p(xi|zi = k; µ) = Z1 πk e−Dψ (xi ,µk ) , where Z is a normalizing constant which ensures case of Bregman emission distributions (2.14).
Maximizing (2.20) with respect to π is straightforward, while maximizing with respect to µk in the case of a Bregman distribution is equivalent to minimizing
i τik Dψ (xi, µk ) ,
i τik
which, by Proposition 2.2, gives
µk = i τik xi
i τik
Note that this can also be seen as computing expected suﬃcient statistics and doing moment-matching (see, e.g., Wainwright and Jordan (2008)). The entire algorithm for mixtures of Breg-man distributions, i.e. Bregman soft-clustering, is given in Algorithm 2.2.
Algorithm 2.2: EM algorithm for soft clustering with Bregman divergences.
Data: x1 , , x n, initial parameters π, µ, Bregman divergence Dψ .
Result: final parameters π, µ, soft-assignments τik

READ Service Traffic Shaping for Resource Protection in Enterprise Data Centers

Hidden Markov Models (HMMs)

Mixture models assume that the observations are i.i.d., and therefore using such a model in the case of our audio representation presented in Section 2.1 would ignore the temporal dependencies in the data. Hidden Markov models (HMMs) (Cappé et al., 2005; Rabiner, 1989) are latent variable models in which the latent state variables evolve with Markovian dynamics, and thus are well-suited for modeling data with sequential structure, as is the case for audio.

Model

Let (xt)t=1..T be our sequence of observations, with xt ∈ Rp, and (zt)t=1..T be our hidden state sequence, where each state is one of K states, that is zt ∈ {1, , K }. An HMM is a generative latent variable model which can be described by the following generative process, for which a graphical model representation is given in Figure 2.2
z1 ∼ π
zt|zt−1 = i ∼ Ai, t = 2, , T
xt|zt = i ∼ pµi , t = 1, , T
where π gives distribution of z1 , A ∈ RK+ ×K is a transition matrix such that Aij = p(zt = j|zt−1 = i), A1 = 1 and Ai = (Aij )j , where 1 = (1, , 1)⊤ . µk is the parameter of the k-th emission distribution, which we will consider to be the distribution associated to a Bregman divergence Dψ , although one can easily extend what follows to other emission distributions such as Gaussians. The joint probability of a sequence z1:T = (z1 , , z T ) and observations x1:T = (x1 , , x T ) is p(x1:T , z1:T ; π, A, µ) = p(z1 ; π)p(zt|zt−1 ; A) p(xt|zt; µ) (2.21) t=2 t=1

Inference

The goal of probabilistic inference is to infer hidden variables from observed variables, when the parameters θ of the model are fixed. This generally means computing the posterior distribution p(zQ|x; θ) over a set of query variables zQ among hidden variables. A diﬀerent form of inference is maximum a posteriori (MAP) inference, which aims at finding an assignment over hidden variables with maximum posterior probability, zM AP = arg maxZ p(z|x; θ) = arg maxZ p(x, z; θ).
In HMMs, there are a few inference tasks of interest:
• Smoothing: compute p(zt|x1:T ) for t < T , uses all available observations to compute the marginal of a particular hidden state.
• Filtering: compute p(zt|x1:t), which is particularly useful when inference is done online, as is the case for the score-following system Antescofo.
• Prediction: compute p(zt|x1:T ) for t > T , useful for predicting future states.
• M AP = arg maxz1:T p(z1:T |x1:T ).
MAP inference: compute the most likely sequence z1:T
We now describe the forward-backward and Viterbi algorithms, which are the standard algo-rithms for posterior and MAP inference, respectively.
Forward-Backward algorithm The main diﬃculty of posterior inference comes from the complicated summations over many latent variables that appear for marginalization and for computing the normalization constants in the target posterior distribution. However, the fac-torized form of the joint probability in Eq. (2.21) makes it possible to propagate these sums in smaller factors and then propagate back, as is common in standard belief propation or message passing algorithms (also known as sum-product algorithms) in tree-structured graphical models, making the algorithm much more eﬃcient. In the specific case of HMMs, it is usual to define
αt(i) = p(zt = i, x1 , , x t)
βt(i) = p(xt+1 , , x T |zt = i)
The forward-backward algorithm computes these quantities recursively as follows. If we set α1 (i) = πip(x1 |z1 = i; µi), the other αt can be computed using the forward recursion αt+1 (j) = p(zt+1 = j, x1 , , x t+1 ) = p(zt = i, zt+1 = j, x1 , , x t+1 ) = p(zt = i, x1 , , x t)p(zt+1 = j|zt = i)p(xt+1 |zt+1 = j) = αt(i)Aij p(xt+1 |zt+1 = j; µj )
Similarly, if we let βT (i) = 1, the βt are computed using the backward recursion βt(i) = p(xt+1 , , x T |zt = i, zt+1 = j)p(zt+1 = j|zt = i) = Aij p(xt+1 |zt+1 = j; µj )βt+1 (j)
The time complexity of the algorithm is O(T K2 ). Note that these quantities can be very small, leading to numerical problems in the implementation. The usual way to deal with this problem is to use the logarithms of all the quantities involved, and use the log-sum-exp operation when a summation is involved.
Once the α and β quantities are computed, one can easily obtain the following useful quantities
p(zt = i|x1:T ) = p(zt = i, x1:t)p(xt+1:T |zt = i) = 1 αt(i)βt(i)
p(x1:T ) Z
p(zt = i, zt+1 = j|x1:T ) = 1 αt(i)Aij p(xt+1 |zt+1 = j; µj )βt+1 (j)
p(zt = i|x1:t) = 1 αt(i)
p(x1:T ) = αT (i),
where Z is the appropriate normalization constant in each equation.
M AP = arg maxz1:T p(z1:T |x1:T ),
Viterbi algorithm In order to compute the MAP estimate z1:T we use a similar recursion to the forward recursion, but where we replace i by maxi. This is known as the max-product or Viterbi algorithm. Let’s define γt(i) = max p(z1 , , z t−1 , zt = i, x1 , , x t) z1 ,…,zt−1
If we let γ1 (i) = πip(x1 |z1 = i; µi), we have the recursion γt+1 (j) = max γt(i)Aij p(xt+1 |zt+1 = j; µj ) (2.22)
By storing backpointers bt+1 (j) = arg maxi γt(i)Aij p(xt+1 |zt+1 = j; µj ), we can then recover the MAP sequence z1:T by taking zT = arg maxi γT (i) and zt = bt+1 (zt+1 ) for t = T − 1, , 1.

EM algorithm

We now derive an EM algorithm for HMMs, allowing us to estimate good parameters θ = (π, A, µ)
from data. We start by computing the complete log-likelihood from Eq. (2.21) ℓ(θ) = log p(x1:T , z1:T ; θ) = ✶{z1 = i} log πi + ✶{zt−1 = i, zt = j} log Aij + ✶{zt = i} log p(xt|zt = i; µi) i t≥2 ij t≥1 i
The E-step thus consists in computing the expected suﬃcient statistics p(zt = i|x1:T ; θ(k) ) and p(zt−1 = i, zt = j|x1:T ; θ(k) ) given current parameter θ(k) , which is an inference task and can be solved with the forward-backward algorithm. The M-step then uses these expected suﬃcient statistics to maximize the expected complete log-likelihood and obtain a new parameter θ(k+1) = arg maxθ′ Ez [ℓ(θ′ )|x1:T ; θ(k ) ]. The resulting algorithm is outlined in Algorithm 2.3. Note that because the likelihood is non-convex, initialization plays a big role, and emissions are often initialized using the results of K-means or a mixture model.
Algorithm 2.3: EM algorithm for HMMs.
Data: x1:T , initial parameters θ = (π, A, µ), Bregman divergence Dψ .
Result: final parameters θ = (π, A, µ).
repeat
// E-step
α, β ← ForwardBackward(x1:T , θ)
τt(i) ← p(zt = i|x1:T ; θ) = Z1 αt(i)βt(i)
τt(i, j) ← p(zt−1 = i, zt = j|x1:T ; θ) = Z1 αt−1 (i)Aij p(xt|zt = j; µj )βt(j)
until convergence;

Hidden Semi-Markov Models (HSMMs)

One drawback of HMMs is their limited capability for expressing segment durations, that is the number of time steps during which the hidden state stays the same. In fact, if we consider the prior probability – ignoring observations – of staying in a given state i for exactly d time steps, it is equal to Adii−1 (1 − Aii), that is, the duration distribution for each state is geometric. Although the parameters Aii can be estimated to give a good fit to the data in many applications, the choice of a geometric distribution can be quite restrictive in some cases.
Hidden Semi-Markov Models (HSMMs) allow us to overcome this problem by explicitely modeling the segment durations with any distribution (Murphy, 2002; Guédon, 2003).

Table of contents :

1 Introduction
2 Representations, models and offline algorithms
2.1 Audio signal representation
2.2 Bregman divergences
2.2.1 Definition and examples
2.2.2 Duality
2.3 Clustering with Bregman divergences
2.3.1 Bregman centroids
2.3.2 K-means
2.3.3 Bregman divergences and exponential families
2.3.4 Mixture models and the EM algorithm
2.4 Hidden Markov Models (HMMs)
2.4.1 Model
2.4.2 Inference
2.4.3 EM algorithm
2.5 Hidden Semi-Markov Models (HSMMs)
2.5.1 Model
2.5.2 Inference
2.5.3 EM algorithm
2.6 Summary and discussion
3 Online algorithms
3.1 Online EM
3.1.1 Online EM for HMMs
3.1.2 Online EM for HSMMs
3.2 Non-probabilistic models
3.2.1 Online algorithm
3.3 Incremental EM
3.3.1 Incremental EM for HMMs
3.3.2 Semi-Markov extension
3.4 Including prior knowledge with Bayesian priors
3.5 Experiments on synthetic data
3.6 Summary and discussion
4 Audio segmentation experiments
4.1 Offline segmentation results
4.2 Online EM for HMMs and HSMMs
4.3 Online vs incremental EM for HMMs
4.4 Segmentation of acoustic scenes
4.5 Summary and discussion