Get Complete Project Material File(s) Now! »

## Justification of Empirical Risk Minimization

We now present the formalism of learning theory that is employed to give formal justification of the ERM. In order to introduce the hypotheses made by learning theory, we first introduce an example. Consider a training dataset D = f(xi, yi)gi=1…N of elements in X Y. We consider the case where X = Rd, ie. the case where the input space is a vector space. Consider that the hypothesis class H is the set of all functions X ! Y. There exists infinitely-many hypotheses h 2 H that describe the data correctly. For instance, the function that outputs yi if the input is equal to xi for i N and y1 otherwise. By construction, this hypothesis is an empirical risk minimizer for the dataset D. It can be used directly for rote learning, in other words when the goal of the learning process is to remember the correct labeling of the set of training inputs. It is intuitively clear that this hypothesis will have extremely low performances on a task that require generalization, ie. if the learning is asked to classify new points outside of the training set. The purpose of statistical learning theory is to provide an evidence that ERM can perform well on a task of generalization in given conditions.

Describing generalization requires that we rigorously define the expected limits of this generalization. A key assumption in this definition is the fact that the data points are chosen independent and identically distributed (i.i.d.). We consider the existence of a generating distribution P on X Y. Based on this distribution, we define the risk of an hypothesis h 2 H as: R(h) = E(X,Y)P [L(Y, h(X))] (2.2) where (X,Y) P means that the random variables X and Y are drawn from distribution P. Based on this definition, the main goal of learning would be to minimize R(h) over h 2 H, but this minimization is impossible when the distribution P is unknown. In practice, the learner has only access to the dataset D, which is a sample of N points drawn i.i.d. from distribution P. The empirical risk defined above is thus an estimator of R(h) and can be used as a proxy for the risk minimization. A fundamental question, however, is to determine how close a solution of ERM principle is to the solution of risk minimization.

### General Presentation of Analogy and Cognition

The importance of analogy in human cognition has been studied for decades now and is closely related to the fundamental ability to extract the similarities between two concepts or situations and to transfer the characteristics of the one to the other. This skill appears at the lowest degree of cognition, for instance in the fundamental task of perception. Children learn to discriminate between thousands of categories in their environment, but are also able to tell the similarities from the differences. For instance, they will find many similarities between a dog and a cat, or even between a dog and another dog, but will also be able to understand them as different. They are also able to apply a “transformation » from one object to the other. For instance, they will get that the transformation from a standard dog to a red dog can be applied to any other object that have similar characteristics, for instance a cat. More impressively, perception of an object can depend on the context: an animal in a picture will not be the same as a real animal, but children are able to map these two radically different entities while being able to tell which one is real. This ability is characteristic of analogical reasoning: An interpretation, given by Structure Mapping Theory (described below), suggests that the child extracts common descriptive information and builds a mapping between the real animal and the drawn animal. The case of drawings is even more interesting, in the sense that sensitivity to the “style » is a basic ability. A drawn dog will not be the same depending on the cartoonist, which does not affect the perception. The transfer from one style to another is possible, and is even considered as a required competence in many artistic domains (for instance being able to paint in the style of a famous painter, to compose in the style of a composer or to write in the style of a writer). Melanie Mitchell proposes the example of music in (Mitchell, 2001): Any two pieces by Mozart are superficially very different, but at some level we perceive a common essence. Likewise, the Beatles rendition of Hey Jude and the version you might hear in the supermarket have little in common in terms of instrumentation, tempo, vocals, and other readily apparent musical features, but we can easily recognize it as the same song.

A domain where analogy is hidden everywhere (but most of the time in an unconscious manner) is natural language. In order to communicate ideas or concepts, we often rely on comparisons and analogies. For instance, it is not rare to hear expressions like “He is the Mozart of painting. » while this expression is a priori very unclear. Does it mean that this painter is also a musician? Died at the same age as Mozart did? We intuitively understand what is meant here: Mozart, in the context of music, is often considered as a prodigy; transferring this characterization to the new context (painting), we understand that the painter of interest has to be considered as a prodigy as well. This example displays two of the most fundamental components of analogical reasoning: on the one hand, it shows that we are able to transfer unified descriptions from one domain to the other; on the other hand, we are even able to determine the domain when it is not defined. No indication was given that Mozart had to be considered in the scope of music, but it is natural. Mimicking this ability will be one of the objectives of connectionist models such as ACME or LISA (Section 3.2.4).

#### Evidences of Syntactic Priming

As a complement to the general introduction on analogy in human cognition, we propose now a brief presentation of the problem of syntactic priming, which is some kind of unconscious analogy observed in language production and interpretation. Syntactic priming is a well-known phenomenon observed in cognitive sciences which leads to reusing previously encountered structures in sentence generation or comprehension. Such phenomenon is observed in several grammatical choices, such as active vs active (“Cats eat mice” vs “Mice are eaten by cats”), dative formation (“He bought the girl a book” vs “He bought a book for the girl”) or relative clause attachment ambiguity (“I like the friend of my sister who plays the violin”: who plays the violin?). In the first two problems, it is observed that encountering one of the forms favors the reuse of the same form; In the case of relative clause attachment, the disambiguation is primed by a former non-ambiguous relative clause. This transfer is typically unconscious.

First studies of the phenomenon emerged in the 1980s with pioneering works on the repetition of similar syntactic forms across successive utterances (Bock, 1986). For instance, some studies observe that, in case a speaker used a passive in a recent sentence, he is more likely to use one in future sentences (Weiner and Labov, 1983). These observations were not sufficient to conclude on the importance of syntactic priming, since they relied on corpus data and did not reflect a preference between two alternatives. Other phenomenons could be at stake here: facility of repetition in either lexical, thematical or metrical aspects. (Bock, 1986) provides the first real evidence of syntactic priming in language production. The existence of syntactic priming was shown in multiple languages (Hartsuiker and Kolk, 1998), and for both written (Branigan, Pickering, and Cleland, 1999) and oral productions (Potter and Lombardi, 1998).

**Table of contents :**

Abstract

Acknowledgements

**1 Introduction: Knowledge Transfer in Artificial Intelligence **

1.1 Scope

1.2 Position and Contributions

1.3 Outline

1.4 List of Publications

**2 Preamble: The Problem of Learning **

2.1 Reminder on Supervised Learning

2.1.1 Problem and Notations

2.1.2 The No-Free-Lunch Theorem

2.1.3 Justification of Empirical Risk Minimization

2.1.4 Conclusion on Supervised Learning

2.2 Minimum Description Length and Minimum Message Length Principles

2.2.1 Learning as Compression

2.2.2 Introducing Minimum Description Length Principle

2.2.3 Introducing Minimum Message Length Principle

2.2.4 Conclusion on MDL and MML

2.3 Drifting Away from Supervised Learning

2.3.1 Unsupervised Domain Adaptation

2.3.2 Unsupervised Learning

2.3.3 Analogies

2.4 Conclusion

**I A Fundamental Problem: Analogical Reasoning **

**3 Introduction to Analogical Reasoning **

3.1 Analogies in Human Cognition

3.1.1 General Presentation of Analogy and Cognition

3.1.2 Evidences of Syntactic Priming

3.2 Formal Models of Analogical Reasoning

3.2.1 Logic Description

3.2.2 Analogical Proportion

3.2.3 Structure Mapping Theory

3.2.4 Connectionist Models

3.2.5 Copycat and Metacat

3.2.6 Analogies and Kolmogorov Complexity

3.3 Applications of Analogical Reasoning

3.3.1 Word Embedding and Analogical Proportion

3.3.2 Linguistic Analogies

3.3.3 Machine Learning Applications

3.4 Conclusion

**4 Minimum Description Length Analogies on Character Strings **

4.1 Introduction: Hofstadter’s Micro-World

4.1.1 Hofstadter’s Micro-World: Presentation and Discussion

4.1.2 An Application: Linguistic Analogies

4.1.3 Method Overview

4.2 Representation Bias for Hofstadter’s Problem

4.2.1 A Generative Language

4.2.2 Basic Operators

4.2.3 Using Memory

4.2.4 Remarks on the Language

4.3 Relevance of a Solution

4.3.1 Relevance: Problems and Intuitions

4.3.2 From Language to Code

4.3.3 Relevance of a Description

4.3.4 Relevance of a Solution for Analogical Equations

4.3.5 Validation

4.4 Perspectives: Finding an Optimal Representation

4.4.1 Syntactic Scanning and Semantic Phase

4.4.2 Rule Generation

4.4.3 World Mapping and Rule Slipping

4.4.4 Rule Execution

4.4.5 Cognitive Interpretation

4.5 Conclusion

**5 Minimum Complexity Analogies **

5.1 A General Description Language for Analogies?

5.1.1 Analogies in Structured Domains

5.1.2 Description Length and Memory Factorization

5.2 Descriptive Graphical Models

5.2.1 Description Length and Kolmogorov Complexity

5.2.2 A Key Property: The Chain Rule

5.2.3 Defining Graphical Models

5.2.4 Machine Restriction

5.2.5 Discussion: DGM and PGM

5.2.6 Algorithmic independence

5.2.7 Inference

5.3 Minimum Complexity Analogies

5.3.1 A Graphical Model for Analogical Reasoning

5.3.2 Application: Priming Effect

5.4 Conclusion

**6 Geometrical analogies **

6.1 Building Analogies in Concept Spaces

6.1.1 Interpretation of the Parallelogram Rule

6.1.2 General Construction of a Parallelogram

6.2 Non-Euclidean Analogies

6.2.1 Intuition: Analogies on the Sphere

6.2.2 Non-Euclidean Analogies

6.2.3 Reminder: Riemannian Geometry

6.2.4 Non-Euclidean Analogies on Differential Manifolds

6.2.5 Proportional Analogies on Manifolds

6.3 Applications

6.3.1 Non-Euclidean Analogies in Fisher Manifold

6.3.1.1 Fisher Manifold

6.3.1.2 Experimental Results

6.3.2 Non-Euclidean Analogies in Curved Concept Spaces

6.4 Conclusion

**II From Analogy to Transfer Learning **

**7 Transfer Learning: An Introduction **

7.1 What is Transfer?

7.1.1 Examples of Transfer Learning Problems

7.1.1.1 Transfer Learning for Computer Vision

7.1.1.2 The Problem of “Small Data »

7.1.2 Background and Notations

7.1.3 Historical References and Related Problems

7.1.4 A Taxonomy of Transfer Learning Settings

7.2 Trends in Transfer Learning

7.2.1 Importance Sampling and Reweighting

7.2.2 Optimal Transport

7.2.3 Mapping and Learning Representations

7.3 A Central Question: When to Transfer?

7.3.1 Introducing Negative Transfer

7.3.2 Guarantees with Small Drifts

7.3.3 Characterizing Task Relatedness

7.4 Conclusion

**8 Transfer Learning with Minimum Description Length Principle **

8.1 Transductive Transfer Learning with Minimum Description Length Principle

8.1.1 Transductive Transfer and Analogy: Two Related Tasks?

8.1.2 What Analogy Suggests

8.1.3 Interpretation: A General Principle?

8.2 Defining Models

8.2.1 Probabilistic models

8.2.2 A prototype-based model

8.2.2.1 Model Complexity

8.2.2.2 Data Complexity

8.3 Validation of the Framework: A Prototype-based Algorithm

8.3.1 Measuring Complexity

8.3.1.1 Complexity of real numbers

8.3.1.2 Complexity of vectors

8.3.1.3 Complexity of prototype model transfer

8.3.2 Algorithm

8.3.2.1 A Class of Functions

8.3.2.2 Unlabeled Data Description without Transfer

8.3.2.3 Labeled Data Description without Transfer

8.3.2.4 Prototype-based Transductive Transfer with Simple Transformation

8.3.3 Measuring the quality of transfer

8.3.4 Toy examples

8.3.5 Results and discussion

8.4 Conclusion

**9 Beyond Transfer: Learning with Concern for Future Questions **

9.1 Supervised and Semi-Supervised Problems with Transfer and without Transfer

9.1.1 Supervised and Semi-Supervised Domain Adaptation

9.1.2 Absence of Transfer

9.2 Impossibility of Transfer?

9.2.1 Two Notions of Transferability

9.2.1.1 Learnability from Source Model

9.2.1.2 Properties of Learnability

9.2.1.3 Transferable Problems

9.2.2 Non-Transferability and Negative Transfer

9.3 Learning with Concern for Future Questions

9.3.1 Transfer to Multiple Targets

9.3.2 Transfer, Transduction and Induction: Which Links?

9.3.3 Learning with No Future in Mind

9.3.4 Including Priors over the Future

9.3.5 Some Priors for Future Questions

9.3.6 Discussion: A general learning paradigm?

9.4 Conclusion

**III Incremental Learning **

**10 From Transfer Learning to Incremental Learning **

10.1 Introduction: Learning in Streaming Environments

10.1.1 A Recent Problem: Stream Mining

10.1.2 Introducing Concept Drift

10.1.3 Passive and Active Methods

10.2 Minimum Complexity Transfer for Incremental Learning

10.2.1 Notations for Online Learning

10.2.2 A Graphical Model for Incremental Learning

10.2.3 Remark: Estimating the Models Online

10.2.4 Classes of Models

10.2.4.1 Active Methods

10.2.4.2 Passive Methods

10.3 Algorithms

10.3.1 Dealing with Previous Models

10.3.2 An Algorithm for Continuous Adaptation

10.3.3 Experimental Results

10.4 Conclusion

**11 Incremental Topic Modeling and Hybrid Recommendation **

11.1 Online Topic Modeling

11.1.1 Topic Modeling

11.1.2 Adaptive windowing for Topic Drift Detection

11.1.2.1 Principle

11.1.2.2 Algorithm

11.1.2.3 Theoretical guarantees

11.1.2.4 Nature of the drift

11.1.3 Experimental Results

11.1.3.1 Datasets

11.1.3.2 Setting of AWILDA

11.1.3.3 Evaluation

11.1.3.4 Comparison of AWILDA and its variants on Sd4

11.1.3.5 Performance of AWILDA on controlled datasets

11.1.3.6 Comparing AWILDA with online LDA

11.1.4 Discussion

11.2 Incremental Hybrid Recommendation

11.2.1 Online Hybrid Recommendation

11.2.2 From Incremental Matrix Factorization to Adaptive Collaborative Topic Modeling

11.2.3 Experimental Results

11.2.3.1 Datasets

11.2.3.2 Evaluation protocol

11.2.3.3 Compared Methods

11.2.3.4 Results and Discussion

11.3 Perspective: Coping with Reoccurring Drifts

11.3.1 Reoccurring Drifts

11.3.2 Drift Adaptation seen as a CBR Problem

11.3.2.1 General Process

11.3.2.2 Case Representation

11.3.2.3 Case Retrieval

11.3.2.4 Case Reuse

11.3.2.5 Case Revision

11.3.2.6 Case Retainment

11.3.3 Application to AWILDA

11.4 Conclusion

**12 U-shaped phenomenon in Incremental Learning **

12.1 Context: Language Acquisition

12.2 A modeling Framework

12.2.1 Assumptions

12.2.2 A Complexity-Based Framework

12.2.3 Computing Complexities

12.2.3.1 Encoding the Grammar

12.2.3.2 Grammar Transfer

12.2.3.3 Encoding the Observations

12.3 Experimental Results

12.3.1 Causes of U-shaped Phenomenon

12.3.2 Finiteness of Memory

12.3.3 Uncorrected Mistakes

12.3.4 Discussion

12.4 Conclusion

**IV Information Transfer in Unsupervised Learning **

**13 Introduction to Multi-Source Clustering **

13.1 Reminder on Clustering

13.1.1 Definition and Issues

13.1.2 Families of Algorithms

13.1.3 Performance Measures

13.2 Multi-Source Clustering: An Overview

13.2.1 Overcoming the Individual Biases

13.2.2 Clustering in Distributed Environments

13.2.3 Multi-view Data

13.2.4 The Solution of Multi-Source Clustering

13.3 Cooperative Clustering

13.3.1 Consensus Based on Objects Co-Occurrence

13.3.2 Consensus Based on Median Partition

13.3.3 Discussion

13.4 Collaborative Clustering

13.5 Conclusion

**14 Complexity-based Multisource Clustering**

14.1 Graphical Model for Unsupervised Collaboration

14.1.1 Notations

14.1.2 A Model for Collaboration

14.2 Complexity of Local Clustering

14.2.1 Complexity of Prototype-Based Models

14.2.2 Complexity of Probabilistic Models

14.2.3 Complexity of Density-Based Models

14.2.4 Complexity of Other Models

14.3 Algorithm for Collaborative Clustering

14.3.1 Forgetting Consensus

14.3.2 Global Approach

14.3.3 Solution Mapping

14.3.4 Mapping Optimization

14.3.5 Dealing with Sparsity

14.4 Experimental Validation

14.4.1 Datasets

14.4.2 Experimental Results

14.5 Conclusion

**15 Can clustering algorithms collaborate? **

15.1 Collaboration: A Difficult Concept in the Absence of Supervision

15.2 Selecting the Best Collaborators

15.2.1 Introducing the problem

15.2.2 Optimizing the Collaboration

15.2.3 Discussion

15.3 Stability of Collaborative Clustering

15.3.1 Reminder: Clustering Stability

15.3.2 Definitions: Collaborative Clustering

15.3.3 Stability of Collaborative Clustering

15.3.4 Perspectives

15.4 Conclusion

**V Conclusion **

**16 Conclusion **

16.1 Contributions

16.1.1 General Contributions

16.1.2 Local Contributions

16.1.2.1 Analogical Reasoning

16.1.2.2 Transfer Learning

16.1.2.3 Data Stream Mining

16.1.2.4 A Cognitive Model

16.1.2.5 Multi-Source Clustering

16.2 Perspectives and Future Works

**A Experiment on Hofstadter’s Analogies **

A.1 Experiment Protocol

A.2 Filtering Results

A.3 Detailed Results

A.3.1 Raw Results

A.3.2 Ages

A.3.3 Results by Question

**B Résumé en Français**

B.1 Un problème fondamental: Le Raisonnement par Analogie

B.1.1 Introduction au Raisonnement par Analogie

B.1.2 Analogies à longueur de description minimale sur les chaînes de caractères

B.1.3 Analogies de complexité minimale

B.1.4 Analogies géométriques

B.2 De l’analogie à l’apprentissage par transfert

B.2.1 Introduction à l’apprentissage par transfert

B.2.2 Apprentissage par transfert et principe de longueur de description minimale

B.2.3 Au-delà du transfert: Apprentissage avec non-indifférence à la question future

B.3 Apprentissage incrémental

B.3.1 De l’apprentissage par transfert à l’apprentissage incrémental

B.3.2 Recommandation incrémentale hybride

B.3.3 Apprentissage incrémental en forme de U

B.4 Transfert d’information en apprentissage non-supervisé

B.4.1 Introduction au clustering multi-sources

B.4.2 Clustering multi-source de complexité minimale

B.4.3 Possibilité de collaboration pour les algorithmes de clustering

**Bibliography**