Tensors and Estimation in Latent Linear Models

Get Complete Project Material File(s) Now! »

Latent Linear Models for Single-View Data

Gaussian Mixture Models

One of the simplest latent linear models for continuous data is the Gaussian mixture model (GMM) [see, e.g., Bishop, 2006, Murphy, 2012, and references therein]. The model assumes hidden states and a Gaussian distribution associated with each state. The generative process consists of two steps : (a) sampling the state from a discrete distribution and (b) sampling an observation from the Gaussian distribution associated to the sampled state (see a graphical representation of such model in Figure 1-1a).
To formalize this model, one introduces a latent variable which can take one of discrete states {1, 2, . . . , }. It is convenient to model using one-hot encoding, i.e. as a -vector z with only -th element equal to one and the rest are zeros (the -th canonical basis vector e ), which corresponds to = . The discrete prior (see (A.4) in Appendix A.1 for the definition) is then used for the state z :
(z) = Mult(1, ) =, (1.1) =1 where the parameter is constrained to the ( − 1)-simplex, i.e. ∈ Δ . For every state , a base distribution of the R -valued observed variable x is modeled as a Gaussian distribution (A.1), i.e. (x| = 1) = (x| , Σ ), which gives the conditional distribution
(x|z) = ∏ (1.2)
(x| , Σ ) .
=1
Therefore, the marginal distribution of the observation variable x is given by
(x) = (z) (x|z) = (x| , Σ ), (1.3)
z =1
∑ ∑
which is a convex combination of the base Gaussian distributions (x| , Σ ) or a Gaussian mixture, hence the name. The fact that the expectation E(x) = D , where the matrix D is formed by stacking the centers , i.e. D = [ 1, 2, . . . , ], explains why the GMM belongs to latent linear models. The GMM is illustrated using the standard in the graphical modeling literature plate notation [Buntine, 1994, Comon and Jutten, 2010, see also Notation Section] in Figure 1-1b. By choosing different distributions as the base distributions, one can obtain mixtures of other distributions with topic models as an example (see Section 1.2).
The estimation for Gaussian mixture models is a difficult task [see, e.g., Dasgupta, 1999, Arora and Kannan, 2001, Anandkumar et al., 2012b].

Factor Analysis

One problem with mixture models is that they only use a single state to generate observations, i.e. each observation can only come from one of base distributions. Indeed, the latent variable in mixture models is represented using one-hot encoding and only one state is sampled at a time. An alternative is to use a real valued vector ∈ R to represent the latent variable. The simplest choice of the prior is again a Gaussian : 1 ∼ (0, I). (1.4)
It is also natural to choose a Gaussian for the conditional distribution of the conti-nuous observation vector x ∈ R : x| ∼ ( + D , Ψ), (1.5) where the vector ∈ R , the matrix D ∈ R × is called the factor loading matrix, and Ψ ∈ R × is the covariance matrix. The elements of the latent variable are also called factors while the columns of D are called factor loadings. This model is known under the name of factor analysis [Bartholomew, 1987, Basilevsky, 1994, Bartholomew et al., 2011] and it makes the conditional independence assumption that the elements 1, 2, . . . of the observed variable x are conditionally independent given the latent variable . Therefore, the covariance matrix Ψ is diagonal.
It is not difficult to show that the marginal distribution of the observed variable is also a Gaussian :
∫ (x) = (x| ) ( ) = (x| , DD⊤ + Ψ). (1.6)
Intuitively, this means that the factor analysis model explains the covariance of the observed data as a combination of two terms : (a) the independent variance associated with each coordinate (in the matrix Ψ) and (b) the covariance between coordinates (captured in the matrix D). Moreover, this representation of the covariance uses a low-rank decomposition (if < ) and only ( ) parameters instead of a full covariance Gaussian with ( 2) parameters. Note, however, that if Ψ is not restricted to be diagonal, it can be trivially set to a full matrix and D to zero, in which case the latent factors would not be required. The factor analysis model is illustrated using the plate notation in Figure 1-2a.
One can view the factor analysis model from the generative point of view. In this case, the observed variable x is sampled by (a) first sampling the latent factors , then (b) applying the linear transformation D to this sampled latent factors and the linear shift 2 , and finally (c) adding the Gaussian noise : x = + D + , (1.7)
where the R -valued additive Gaussian noise is ∼ (0, Ψ) (see an illustration using the plate notation in Figure (1-2b)). This point of view explains why factor analysis is a latent linear model : it is essentially a linear transformation of latent factors.
Although inference in the factor analysis model is an easy task, the model is unidenti-fiable. Indeed, the covariance of the observed variable under the factor analysis model in (1.6) has the term DD⊤. Let Q be an arbitrary × orthogonal matrix. Then right-multiplying D by this orthogonal matrix, i.e. D̃ = DQ, does not change the dis-tribution : D̃ D̃ ⊤ = D QQ⊤D⊤ = DD⊤. Thus a whole family of matrices D̃ gives rise to the same likelihood (1.6). Geometrically, multiplying D by an orthogonal matrix can be seen as a rotation of the latent factors before generating the observations x. However, since is drawn from an isotropic Gaussian, this does not influence the likelihood. Consequently, one can not uniquely identify the parameter D, nor can one identify the latent factors , independently of the type of estimation and inference methods used.
This unidentifiability does not influence the predictive performance of the factor ana-lysis model, since the likelihood does not change. However, it does affect the factor loading matrix, and, therefore, the interpretation of the latent factors. Since factor analysis is often used to uncover the latent structure in the data, this issue causes serious problems. Numerous attempts were made to address this problem by adding additional assumptions on the model. This includes some heuristic methods for choo-sing a “meaningful” rotation of the latent factors, e.g., the varimax approach [Kaiser, 1958], which maximizes the variance of the squared loadings of a factor on all the variables. More rigorous approaches are based on adding supplementary constraints on the factor loading matrix, the most noticeable one is perhaps sparse principal component analysis [Zou et al., 2006], which is a separate field of research on its own [see, e.g., Archambeau and Bach, 2008, d’Aspremont et al., 2008, Journée et al., 2010, d’Aspremont et al., 2014]. An alternative approach is to use non-Gaussian priors for the latent factors, which is well known under the name of independent component analysis (see Section 1.1.4).
Factor analysis was also extended to multiway data 3 as parallel factor analysis (Pa-rafac) [Harshman and Lundy, 1994], or three-mode principal component analysis [Kroonenberg, 1983]. Interestingly, Parafac is also the tensor decomposition which is used in the algorithmic framework of this thesis (see Section 2.1.2).

Probabilistic Principal Component Analysis

Standard principal component analysis (PCA) [Pearson, 1901, Jolliffe, 2002] is an algebraic tool that finds a low-dimensional subspace such that if the original data is projected onto this subspace then the variance of the projected data is maximi-zed. It is well known that this subspace can be defined by the empirical mean of the data sample and the eigenvectors of the empirical covariance matrix. The eigen-vectors of this covariance matrix, sorted in the decreasing order of the eigenvalues, are called principal directions. Although this PCA solution is uniquely defined (gi-ven all eigenvalues are distinct), principal component form a possible basis of the “best” low-dimensional subspace ; any other basis, e.g., obtained with any orthogo-nal transformation of principal components, would be a solution as well. As we shall see shortly, this solution is directly related to a special case of the factor analysis model. Therefore, standard PCA partially resolves the unidentifiability of factor ana-lysis. However, since each principal component is a linear combination of the original variables, the PCA solution is still difficult to interpret.
Although PCA is not necessarily considered to be a method based on Gaussian dis-tributions, it can be justified using Gaussians. Indeed, a particular case of the factor analysis model when the covariance is isotropic, i.e. Ψ = 2I :
∼ (0, I),
(1.8)
x| ∼ ( + D , 2I),
is known under the name of probabilistic principal component analysis (see an illus-tration using the plate notation in Figure 1-3a).
[Roweis, 1998, Tipping and Bishop, 1999] show a probabilistic interpretation of PCA : the PCA solution can be expressed as a maximum likelihood solution of the proba-bilistic principal component analysis model when → 0. In particular, the factor loading matrix of probabilistic PCA is equal to D = V(Λ − 2I)1/2Q, where V is the matrix with principal directions in the columns, Λ is the diagonal matrix with respective eigenvalues of the empirical covariance matrix on the diagonal, and Q is an arbitrary orthogonal matrix. This unidentifiability of probabilistic PCA is inherited from factor analysis. Therefore, PCA is unidentifiable as well. Despite the fact that the standard PCA solution is unique, PCA is defined as a subspace and the principal directions are a basis of this subspace. An arbitrary rotation of this basis does not change the subspace.

Independent Component Analysis

Independent component analysis (ICA) [Jutten, 1987, Jutten and Hérault, 1991, Hyvärinen et al., 2001, Comon, 1994, Comon and Jutten, 2010] was originally de-veloped in the blind source separation (BSS) context. A typical BSS problem is the so called cocktail party problem : we are given several speakers (sources) and several microphones (sensors), detecting a linear combination of the mixed noisy signal. The task is to separate the individual sources from the mixed signal.
Noisy ICA. ICA models this problem in a natural way as follows x = D + , (1.9) where the vector x ∈ R represents the observed signal, the vector ∈ R with mutually independent components stands for latent sources, the vector ∈ R is the additive noise, and the matrix D ∈ R × is the mixing matrix.
Noiseless ICA. Often, to simplify the estimation and inference in the ICA model, it is common to assume that the noise level is zero, in which case one rewrites the ICA model (1.9) as : x = D . (1.10)
Alternatively, another simplifying assumption for the noisy ICA model in (1.9) is that the noise is Gaussian (see, e.g., Section 2.4).
Identifiability. It is straightforward to see the connection between the factor ana-lysis formulation in (1.7) and the ICA model in (1.9). In fact, factor analysis is a special case of the ICA model where the sources and additive noise are constrained to be independent Gaussians (one can ignore the shift vector since observations can be centered to have zero-mean). However, ICA generally relaxes the Gaussianity as-sumption, preserving only the independence of sources, although assumptions on the additive noise may vary. The Gaussianity assumption on the sources can be too res-trictive and considering other priors can lead to models with higher expressive power. Moreover, as we mentioned in Section 1.1.2, the Gaussian latent factors (sources) are actually the reason of the unidentifiability of factor analysis. Indeed, a well known result 4 says that the mixing matrix and the latent sources of ICA are essentially identifiable (see below) if at most one source is Gaussian [Comon, 1994]. Hence, one can see ICA as an identifiable version of factor analysis.
In any case, the permutation and scaling of the mixing matrix and sources in the ICA model (as well as all other latent linear models) can never be identified. Indeed, the product d does not change if one simultaneously rescales (including the sign change) the terms by some non-zero constant ̸= 0 : ( d )( −1 ) = d ; nor does the product D change if one consistently permutes both the columns of D and the elements of . Therefore, it only makes sense to talk about the identifiability up to permutation and scaling, which is sometimes referred to as essential identifiability [see, e.g., Comon and Jutten, 2010]. One can also define a canonical form where, e.g., the columns of the mixing matrix are constrained to have the unit ℓ1-norm.
Independent Subspace Analysis (ISA). An interesting geometric interpretation of the permutation and scaling unidentifiability was provided by Cardoso [1998], where ICA is equivalently interpreted as the sum of vectors w ∈ : ∑ (1.11) x = w + , =1 from one-dimensional subspaces := w ∈ R : w = d , ∈ R determined by the vectors d . Each such subspace can{ actually be identified, given the vectors d } are linearly independent, but the representation of every such subspace is clearly not unique. This gives rise to multidimensional independent component analysis (MICA), which looks for orthogonal projections on (not necessary one-dimensional) subspaces := w ∈ R : w = =1 ( )d( ), ∀ ( ) ∈ R , where is the dimension of the -th { ∑ (1) } (2) ( ) D . In such a model, subspace, rather than looking for the linear transformation 1 2 and the total ( , ,…, ) =1 the source vector consists of blocks, = , where each block is ( ) = ( ) , ( ) ,…, ( ) ) () () number of sources is preserved = . For ( ( ) ∑ such sources, the independence assumption is replaced with the following : the tuples inside of one block 1 , 2 , . . . , can be dependent, however, the blocks (1), (2), . . . , ( ) are mutually independent.
This model is also known under the name of independent subspace analysis (ISA) 5 [Hyvärinen and Hoyer, 2000]. Cardoso [1998] conjectured that the ISA problem can be solved by first solving the ICA task and then clustering the ICA elements into statistically independent groups. Szaboó et al. [2007] prove that this is indeed the case : under some additional conditions, the solution of the ISA task reduces to a permutation of the ICA task [see also Szaboó et al., 2012].
A special case of ICA estimation and inference algorithms—known as algebraic cumulant-based algorithms—are of central importance in this thesis. We describe these algo-rithms in Section 2.2.2 and 2.4.2. and use them to develop fast and efficient algorithms for topic models through a close connection to ICA (see Chapter 3).

READ EXPERIENCES OF CARE AS BOTH MEANINGFUL AND REWARDING AS WELL AS BURDENSOME

Dictionary Learning

Another class of latent linear models is the signal processing tool called dictionary learning [Olshausen and Field, 1996, 1997], which targets approximation of the ob-served signal x ∈ R with the linear combination of the dictionary atoms, which are columns of the matrix D ∈ R × . A special case of dictionary learning is the ℓ1-sparse coding problem. It aims at minimizing ‖x − D ‖22 + ‖ ‖1, which is well known to enforce sparsity on given the regularization parameter is chosen appro-priately [Tibshirani, 1996, Chen et al., 1999]. This minimization problem is equivalent to the maximum a posteriori estimator of the noisy ICA model (1.9) where the addi-tive noise is Gaussian and the sources are independent Laplace variables. The Laplace distribution is often considered as sparsity inducing prior since it has (slightly) hea-vier tails than Gaussian. Another way to see the connection between the two models is replacing the ℓ2-distance with the KL-divergence and looking for the so-called de-mixing matrix which minimizes the mutual information between the demixed signals [this is one of approaches to ICA ; see, e.g., Comon and Jutten, 2010]. However, these topics are outside of the scope of this thesis.

Latent Linear Models for Count Data

Admixture and Topic Models

The models described in Section 1.1 are designed for continuous data. Similar tech-niques are often desirable for count data, i.e. non-negative and discrete, which often appears when working, e.g., with text or images. Directly applying these models to count data does not work in practice (in the noiseless setting) : the equality x = D is only possible when both D and are discrete and usually non-negative. Moreover, negative values, that can appear in the latent factor vectors or factor loading matrix, create interpretation problems [Buntine and Jakulin, 2006]. To fix this, one could turn count data into continuous, e.g., using the term frequency–inverse document frequency (tf-idf ) values for text documents [Baeza-Yates and Ribeiro-Neto, 1999]. However, this does not solve the interpretability issues.
Topic Models. An algebraic adaptation of PCA to discrete data is well known as non-negative matrix factorization [Lee and Seung, 1999, 2001]. NMF with the KL-divergence as the objective function is equivalent to probabilistic latent semantic indexing (pLSI) [see Section 1.2.3 ; Hofmann, 1999a,b], which is probably the simplest and historically one of the first probabilistic topic model. Topic models can be seen as probabilistic latent (linear) models adapted to count data. Latent Dirichlet allocation (LDA) [Blei et al., 2003] is probably the most widely used topic model and extends pLSI from a discrete mixture model to an admixture model (see below). In fact, it was shown that pLSI is a special case 6 of LDA [Girolami and Kabán, 2003]. Buntine and Jakulin [2006] propose to use an umbrella term discrete component analysis for these and related models. Indeed, all these models enforce constraints on the latent variables and linear transformation matrix to preserve non-negativity and discreteness, which are intrinsic to count data.
Admixture Model for Count Data. A natural extension of the models from Sec-tion 1.1 to count data is to replace the equality x = D with the equality in expec-tation E(x| ) = D , which gives an admixture model [Pritchard et al., 2000] : ∼ PD (c1), (1.12) x| ∼ PDx(D , c2), such that E(x| ) = D , where PD (•) is a continuous vector valued probability density function of the latent vectors with a hyper-parameter vector c1 ; and PDx(•) is a discrete-valued pro-bability distribution of the observation vector x conditioned on the latent vector with a hyper-parameter vector c2 [Buntine and Jakulin, 2005]. Admixture models are latent linear models since the expectation of the observation vector is equal to a linear transformation of the latent vector.

Topic Models Terminology

Topic models [Steyvers and Griffiths, 2007, Blei and Lafferty, 2009, Blei, 2012] are probabilistic models that allow to discover thematic information in text corpora and annotate the documents using this information.
Although it is common to describe topic models using the text modeling terminology, applications of topic models go far beyond information retrieval applications. For example, topic models were successfully applied in computer vision using the notion of visual words and the computer vision bag-of-words model [Sivic and Zisserman, 2003, Wang and Grimson, 2008, Sivic and Zisserman, 2009]. However, we restrict ourselves to the standard text corpora terminology, which we summarize below.
The vocabulary is a set := { 1, 2, . . . , } of all the words in the language. The number of words in the vocabulary is called the vocabulary size. Each word is represented using the one-hot encoding over words. In the literature, the name term is also used to refer to a word.
The document is a set := {w1, w2, . . . , w } of tokens wℓ ∈ R , for ℓ ∈ [ ], where a token is some word, i.e. wℓ = , and is the length of a document. Two tokens in a document can be equal to the same word from the vocabulary, but words are unique. The bag-of-words model [Baeza-Yates and Ribeiro-Neto, 1999] assumes that the order of tokens in a document does not matter. The count vector x ∈ R of a document is a vector with the -th element equal to the number of times the -th word from the vocabulary appears in this document, i.e. x = ∑ ℓ=1 wℓ.
The corpus is a set := { 1, 2, . . . , } of documents. The count matrix X of this corpus is the × matrix with the -th column equal to the count vector x of the -th document. The matrix X is sometimes also called (word-document) co-occurrence matrix.
There are topics in a model, where the -th topic d is a parameter vector of a discrete distribution over the words in the vocabulary, i.e. d ∈ Δ (see also Figure 1.1 for an example of topics displayed as the most probable words). The -th element of such a vector indicates the probability with which the -th word from the vocabulary appears in the -th topic. The matrix D ∈ R × obtained by stacking the topics together, D = [d1, d2, . . . , d ], is called the topic matrix. Note that in our notation = , i.e. the index order is reverted.
We will always use the index ∈ [ ] to refer to topics, the index ∈ [ ] to refer to documents, the index ∈ [ ] to refer to words from the vocabulary, and the index ℓ ∈ [ ] to refer to tokens of the -th document.

Probabilistic Latent Semantic Indexing

Latent Semantic Indexing. LSI [Deerwester et al., 1990] is a linear algebra tool for mapping documents to a vector space of reduced dimensionality, the latent semantic space. LSI is obtained as a low-rank- approximation (see Section 2.1.3) of the (word-document) co-occurrence matrix. LSI is nearly equivalent to standard PCA : the only difference is that in LSI the documents are not centered (the mean is not subtracted) prior to computing the SVD of the co-occurrence matrix, which is normally done to preserve sparsity. The hope behind LSI is that words with the same common meaning are mapped to roughly the same direction in the latent space, which allows to compute meaningful association values between pairs of documents, even if the documents do not have any terms in common. However, LSI does not guarantee non-negative values in the latent space, which is undesirable for the interpretation purposes of non-negative count data.
Probabilistic Latent Semantic Indexing. A direct probabilistic extension of LSI is probabilistic latent semantic indexing (pLSI) [Hofmann, 1999a,b]. The pLSI model is a discrete mixture model and, similarly to the Gaussian mixture model (see Sec-tion 1.1.1), the latent variable of the pLSI model can take one of states, modeled as before by the -vector z with the one-hot encoding. The observed variables are do-cuments, modeled as the -vector i with the one-hot encoding, and tokens, modeled as the -vectors w with the one-hot encoding.
The generative pLSI model of a token in the so-called symmetric 7 parametrization (a) first picks the topic z and then, given this topic, (b) picks the document i and (c) picks the token w for the picked document form the discrete distribution characterized by the -th topic d , where is such that = 1. This gives the following joint probability model : (i, w) = (z) (i|z) (w|z). (1.13) z
The respective graphical model is illustrated in Figure 1-4a. It is interesting to notice that in this formulation pLSI can be seen as a model for working with multi-view data (see Section 1.3 and Chapter 4) and directly admits extension to more than two or three views (see the explanation under the probabilistic interpretation of the non-negative CP decomposition of tensors in Section 2.1.2). Therefore, pLSI easily extends to model co-occurrence of three and more variables. It is also well known [Gaussier and Goutte, 2005, Ding et al., 2006, 2008] that pLSI can be seen as probabilistic inter-pretation 8 of non-negative matrix factorization (NMF) [Lee and Seung, 1999, 2001]. Therefore, the mentioned multi-view extension of pLSI can be seen as a probabilistic interpretation of the non-negative canonical polyadic (NCP) decomposition of tensors (see Section 2.1.2).
The symmetric pLSI model makes the following two independence assumptions :
(a) the bag-of-words assumption, i.e. the observation pairs (i, w) are assumed to be generated independently and (b) tokens w are generated conditionally independent of the specific document identity i given the latent class (topic) z.
Such model allows to handle (a) polysemous words, i.e. words that may have multiple senses and multiple types of usage in a different context, and (b) synonyms, i.e. different words that may have similar meaning or denote the same context.
It is not difficult to show with the help of the Bayes rule applied to (i|z), that the joint density in (1.13) can be equivalently represented as ∑ (i, w) = (i) (w|i), (w|i) = (w|z) (z|i), (1.14) which is known under the name of the asymmetric parametrization (see an illustration using the plate notation in Figure 1-4b).
The conditional distribution (z|i) is the discrete distribution with the parameter ∈ Δ , where is such that = 1 : ∏ (z|i) = . =1
Note that for each document, there is one parameter ; one can form a matrix Θ = [ 1, 2, . . . , ] of these parameters. Substituting (z|i) into the expression for (w|i), we obtain : ∑ (1.15) (w|i) =(w|z ), =1 where we used z to emphasize that this is the vector z with the -th element equal to 1. Therefore, the conditional distribution of the tokens in a document is the mix-ture of discrete distributions (w|z ) over the vocabulary of words with latent topics. Substituting the discrete distributions (w|z ) with the parameters d into the conditional distribution (1.15).

Table of contents :

1 Latent Linear Models
1.1 Latent Linear Models for Single-View Data
1.1.1 Gaussian Mixture Models
1.1.2 Factor Analysis
1.1.3 Probabilistic Principal Component Analysis
1.1.4 Independent Component Analysis
1.1.5 Dictionary Learning
1.2 Latent Linear Models for Count Data
1.2.1 Admixture and Topic Models
1.2.2 Topic Models Terminology
1.2.3 Probabilistic Latent Semantic Indexing
1.2.4 Latent Dirichlet Allocation
1.2.5 Other Topic Models
1.3 Latent Linear Models for Multi-View Data
1.3.1 Probabilistic Canonical Correlation Analysis
1.4 Overcomplete Latent Linear Models
2 Tensors and Estimation in Latent Linear Models
2.1 Tensors, Higher Order Statistics, and CPD
2.1.1 Tensors
2.1.2 The Canonical Polyadic Decomposition
2.1.3 Tensor Rank and Low-Rank Approximation
2.1.4 CP Uniqueness and Identifiability
2.2 Higher Order Statistics
2.2.1 Moments, Cumulants, and Generating Functions
2.2.2 CPD of ICA Cumulants
2.2.3 CPD of LDA Moments
2.3 Algorithms for the CP Decomposition
2.3.1 Algorithms for Orthogonal Symmetric CPD
2.3.2 Algorithms for Non-Orthogonal Non-Symmetric CPD
2.4 Latent Linear Models : Estimation and Inference
2.4.1 The Expectation Maximization Algorithm
2.4.2 Moment Matching Techniques
3 Moment Matching-Based Estimation in Topic Models
3.1 Contributions
3.2 Related Work
3.3 Discrete ICA
3.3.1 Topic Models are PCA for Count Data
3.3.2 GP and Discrete ICA Cumulants
3.3.3 Sample Complexity
3.4 Estimation in the GP and DICA Models
3.4.1 Analysis of the Whitening and Recovery Error
3.5 Experiments
3.5.1 Datasets
3.5.2 Code and Complexity
3.5.3 Comparison of the Diagonalization Algorithms
3.5.4 The GP/DICA Cumulants vs. the LDA Moments
3.5.5 Real Data Experiments
3.6 Conclusion
4 Moment Matching-Based Estimation in Multi-View Models
4.1 Contributions
4.2 Related Work
4.3 Non-Gaussian CCA
4.3.1 Non-Gaussian, Discrete, and Mixed CCA
4.3.2 Identifiability of Non-Gaussian CCA
4.3.3 The Proof of Theorem 4.3.1
4.4 Cumulants and Generalized Covariance Matrices
4.4.1 Discrete CCA Cumulants
4.4.2 Generalized Covariance Matrices
4.5 Estimation in Non-Gaussian, Discrete, and Mixed CCA
4.6 Experiments
4.6.1 Synthetic Count Data
4.6.2 Synthetic Continuous Data
4.6.3 Real Data Experiment – Translation Topics
4.7 Conclusion
5 Conclusion and Future Work
5.1 Algorithms for the CP Decomposition
5.2 Inference for Semiparametric Models
A Notation
A.1 The List of Probability Distributions
B Discrete ICA
B.1 The Order-Three DICA Cumulant
B.2 The Sketch of the Proof for Proposition 3.3.1
B.2.1 Expected Squared Error for the Sample Expectation
B.2.2 Expected Squared Error for the Sample Covariance
B.2.3 Expected Squared Error of the Estimator ̂︀ S for the GP/DICA Cumulants
B.2.4 Auxiliary Expressions
C Implementation
C.1 Implementation of Finite Sample Estimators
C.1.1 Expressions for Fast Implementation of the LDA Moments Finite Sample Estimators
C.1.2 Expressions for Fast Implementation of the DICA Cumulants Finite Sample Estimators
C.2 Multi-View Models
C.2.1 Finite Sample Estimators of the DCCA Cumulants