Minimum Description Length Analogies on Character Strings

Get Complete Project Material File(s) Now! »

Justification of Empirical Risk Minimization

We now present the formalism of learning theory that is employed to give formal justification of the ERM. In order to introduce the hypotheses made by learning theory, we first introduce an example. Consider a training dataset D = f(xi, yi)gi=1…N of elements in X Y. We consider the case where X = Rd, ie. the case where the input space is a vector space. Consider that the hypothesis class H is the set of all functions X ! Y. There exists infinitely-many hypotheses h 2 H that describe the data correctly. For instance, the function that outputs yi if the input is equal to xi for i N and y1 otherwise. By construction, this hypothesis is an empirical risk minimizer for the dataset D. It can be used directly for rote learning, in other words when the goal of the learning process is to remember the correct labeling of the set of training inputs. It is intuitively clear that this hypothesis will have extremely low performances on a task that require generalization, ie. if the learning is asked to classify new points outside of the training set. The purpose of statistical learning theory is to provide an evidence that ERM can perform well on a task of generalization in given conditions.
Describing generalization requires that we rigorously define the expected limits of this generalization. A key assumption in this definition is the fact that the data points are chosen independent and identically distributed (i.i.d.). We consider the existence of a generating distribution P on X Y. Based on this distribution, we define the risk of an hypothesis h 2 H as: R(h) = E(X,Y)P [L(Y, h(X))] (2.2) where (X,Y) P means that the random variables X and Y are drawn from distribution P. Based on this definition, the main goal of learning would be to minimize R(h) over h 2 H, but this minimization is impossible when the distribution P is unknown. In practice, the learner has only access to the dataset D, which is a sample of N points drawn i.i.d. from distribution P. The empirical risk defined above is thus an estimator of R(h) and can be used as a proxy for the risk minimization. A fundamental question, however, is to determine how close a solution of ERM principle is to the solution of risk minimization.

General Presentation of Analogy and Cognition

The importance of analogy in human cognition has been studied for decades now and is closely related to the fundamental ability to extract the similarities between two concepts or situations and to transfer the characteristics of the one to the other. This skill appears at the lowest degree of cognition, for instance in the fundamental task of perception. Children learn to discriminate between thousands of categories in their environment, but are also able to tell the similarities from the differences. For instance, they will find many similarities between a dog and a cat, or even between a dog and another dog, but will also be able to understand them as different. They are also able to apply a “transformation » from one object to the other. For instance, they will get that the transformation from a standard dog to a red dog can be applied to any other object that have similar characteristics, for instance a cat. More impressively, perception of an object can depend on the context: an animal in a picture will not be the same as a real animal, but children are able to map these two radically different entities while being able to tell which one is real. This ability is characteristic of analogical reasoning: An interpretation, given by Structure Mapping Theory (described below), suggests that the child extracts common descriptive information and builds a mapping between the real animal and the drawn animal. The case of drawings is even more interesting, in the sense that sensitivity to the “style » is a basic ability. A drawn dog will not be the same depending on the cartoonist, which does not affect the perception. The transfer from one style to another is possible, and is even considered as a required competence in many artistic domains (for instance being able to paint in the style of a famous painter, to compose in the style of a composer or to write in the style of a writer). Melanie Mitchell proposes the example of music in (Mitchell, 2001): Any two pieces by Mozart are superficially very different, but at some level we perceive a common essence. Likewise, the Beatles rendition of Hey Jude and the version you might hear in the supermarket have little in common in terms of instrumentation, tempo, vocals, and other readily apparent musical features, but we can easily recognize it as the same song.
A domain where analogy is hidden everywhere (but most of the time in an unconscious manner) is natural language. In order to communicate ideas or concepts, we often rely on comparisons and analogies. For instance, it is not rare to hear expressions like “He is the Mozart of painting. » while this expression is a priori very unclear. Does it mean that this painter is also a musician? Died at the same age as Mozart did? We intuitively understand what is meant here: Mozart, in the context of music, is often considered as a prodigy; transferring this characterization to the new context (painting), we understand that the painter of interest has to be considered as a prodigy as well. This example displays two of the most fundamental components of analogical reasoning: on the one hand, it shows that we are able to transfer unified descriptions from one domain to the other; on the other hand, we are even able to determine the domain when it is not defined. No indication was given that Mozart had to be considered in the scope of music, but it is natural. Mimicking this ability will be one of the objectives of connectionist models such as ACME or LISA (Section 3.2.4).

Evidences of Syntactic Priming

As a complement to the general introduction on analogy in human cognition, we propose now a brief presentation of the problem of syntactic priming, which is some kind of unconscious analogy observed in language production and interpretation. Syntactic priming is a well-known phenomenon observed in cognitive sciences which leads to reusing previously encountered structures in sentence generation or comprehension. Such phenomenon is observed in several grammatical choices, such as active vs active (“Cats eat mice” vs “Mice are eaten by cats”), dative formation (“He bought the girl a book” vs “He bought a book for the girl”) or relative clause attachment ambiguity (“I like the friend of my sister who plays the violin”: who plays the violin?). In the first two problems, it is observed that encountering one of the forms favors the reuse of the same form; In the case of relative clause attachment, the disambiguation is primed by a former non-ambiguous relative clause. This transfer is typically unconscious.
First studies of the phenomenon emerged in the 1980s with pioneering works on the repetition of similar syntactic forms across successive utterances (Bock, 1986). For instance, some studies observe that, in case a speaker used a passive in a recent sentence, he is more likely to use one in future sentences (Weiner and Labov, 1983). These observations were not sufficient to conclude on the importance of syntactic priming, since they relied on corpus data and did not reflect a preference between two alternatives. Other phenomenons could be at stake here: facility of repetition in either lexical, thematical or metrical aspects. (Bock, 1986) provides the first real evidence of syntactic priming in language production. The existence of syntactic priming was shown in multiple languages (Hartsuiker and Kolk, 1998), and for both written (Branigan, Pickering, and Cleland, 1999) and oral productions (Potter and Lombardi, 1998).

Table of contents :

Abstract
Acknowledgements
1 Introduction: Knowledge Transfer in Artificial Intelligence
1.1 Scope
1.2 Position and Contributions
1.3 Outline
1.4 List of Publications
2 Preamble: The Problem of Learning
2.1 Reminder on Supervised Learning
2.1.1 Problem and Notations
2.1.2 The No-Free-Lunch Theorem
2.1.3 Justification of Empirical Risk Minimization
2.1.4 Conclusion on Supervised Learning
2.2 Minimum Description Length and Minimum Message Length Principles
2.2.1 Learning as Compression
2.2.2 Introducing Minimum Description Length Principle
2.2.3 Introducing Minimum Message Length Principle
2.2.4 Conclusion on MDL and MML
2.3 Drifting Away from Supervised Learning
2.3.1 Unsupervised Domain Adaptation
2.3.2 Unsupervised Learning
2.3.3 Analogies
2.4 Conclusion
I A Fundamental Problem: Analogical Reasoning
3 Introduction to Analogical Reasoning
3.1 Analogies in Human Cognition
3.1.1 General Presentation of Analogy and Cognition
3.1.2 Evidences of Syntactic Priming
3.2 Formal Models of Analogical Reasoning
3.2.1 Logic Description
3.2.2 Analogical Proportion
3.2.3 Structure Mapping Theory
3.2.4 Connectionist Models
3.2.5 Copycat and Metacat
3.2.6 Analogies and Kolmogorov Complexity
3.3 Applications of Analogical Reasoning
3.3.1 Word Embedding and Analogical Proportion
3.3.2 Linguistic Analogies
3.3.3 Machine Learning Applications
3.4 Conclusion
4 Minimum Description Length Analogies on Character Strings
4.1 Introduction: Hofstadter’s Micro-World
4.1.1 Hofstadter’s Micro-World: Presentation and Discussion
4.1.2 An Application: Linguistic Analogies
4.1.3 Method Overview
4.2 Representation Bias for Hofstadter’s Problem
4.2.1 A Generative Language
4.2.2 Basic Operators
4.2.3 Using Memory
4.2.4 Remarks on the Language
4.3 Relevance of a Solution
4.3.1 Relevance: Problems and Intuitions
4.3.2 From Language to Code
4.3.3 Relevance of a Description
4.3.4 Relevance of a Solution for Analogical Equations
4.3.5 Validation
4.4 Perspectives: Finding an Optimal Representation
4.4.1 Syntactic Scanning and Semantic Phase
4.4.2 Rule Generation
4.4.3 World Mapping and Rule Slipping
4.4.4 Rule Execution
4.4.5 Cognitive Interpretation
4.5 Conclusion
5 Minimum Complexity Analogies
5.1 A General Description Language for Analogies?
5.1.1 Analogies in Structured Domains
5.1.2 Description Length and Memory Factorization
5.2 Descriptive Graphical Models
5.2.1 Description Length and Kolmogorov Complexity
5.2.2 A Key Property: The Chain Rule
5.2.3 Defining Graphical Models
5.2.4 Machine Restriction
5.2.5 Discussion: DGM and PGM
5.2.6 Algorithmic independence
5.2.7 Inference
5.3 Minimum Complexity Analogies
5.3.1 A Graphical Model for Analogical Reasoning
5.3.2 Application: Priming Effect
5.4 Conclusion
6 Geometrical analogies
6.1 Building Analogies in Concept Spaces
6.1.1 Interpretation of the Parallelogram Rule
6.1.2 General Construction of a Parallelogram
6.2 Non-Euclidean Analogies
6.2.1 Intuition: Analogies on the Sphere
6.2.2 Non-Euclidean Analogies
6.2.3 Reminder: Riemannian Geometry
6.2.4 Non-Euclidean Analogies on Differential Manifolds
6.2.5 Proportional Analogies on Manifolds
6.3 Applications
6.3.1 Non-Euclidean Analogies in Fisher Manifold
6.3.1.1 Fisher Manifold
6.3.1.2 Experimental Results
6.3.2 Non-Euclidean Analogies in Curved Concept Spaces
6.4 Conclusion
II From Analogy to Transfer Learning
7 Transfer Learning: An Introduction
7.1 What is Transfer?
7.1.1 Examples of Transfer Learning Problems
7.1.1.1 Transfer Learning for Computer Vision
7.1.1.2 The Problem of “Small Data »
7.1.2 Background and Notations
7.1.3 Historical References and Related Problems
7.1.4 A Taxonomy of Transfer Learning Settings
7.2 Trends in Transfer Learning
7.2.1 Importance Sampling and Reweighting
7.2.2 Optimal Transport
7.2.3 Mapping and Learning Representations
7.3 A Central Question: When to Transfer?
7.3.1 Introducing Negative Transfer
7.3.2 Guarantees with Small Drifts
7.3.3 Characterizing Task Relatedness
7.4 Conclusion
8 Transfer Learning with Minimum Description Length Principle
8.1 Transductive Transfer Learning with Minimum Description Length Principle
8.1.1 Transductive Transfer and Analogy: Two Related Tasks?
8.1.2 What Analogy Suggests
8.1.3 Interpretation: A General Principle?
8.2 Defining Models
8.2.1 Probabilistic models
8.2.2 A prototype-based model
8.2.2.1 Model Complexity
8.2.2.2 Data Complexity
8.3 Validation of the Framework: A Prototype-based Algorithm
8.3.1 Measuring Complexity
8.3.1.1 Complexity of real numbers
8.3.1.2 Complexity of vectors
8.3.1.3 Complexity of prototype model transfer
8.3.2 Algorithm
8.3.2.1 A Class of Functions
8.3.2.2 Unlabeled Data Description without Transfer
8.3.2.3 Labeled Data Description without Transfer
8.3.2.4 Prototype-based Transductive Transfer with Simple Transformation
8.3.3 Measuring the quality of transfer
8.3.4 Toy examples
8.3.5 Results and discussion
8.4 Conclusion
9 Beyond Transfer: Learning with Concern for Future Questions
9.1 Supervised and Semi-Supervised Problems with Transfer and without Transfer
9.1.1 Supervised and Semi-Supervised Domain Adaptation
9.1.2 Absence of Transfer
9.2 Impossibility of Transfer?
9.2.1 Two Notions of Transferability
9.2.1.1 Learnability from Source Model
9.2.1.2 Properties of Learnability
9.2.1.3 Transferable Problems
9.2.2 Non-Transferability and Negative Transfer
9.3 Learning with Concern for Future Questions
9.3.1 Transfer to Multiple Targets
9.3.2 Transfer, Transduction and Induction: Which Links?
9.3.3 Learning with No Future in Mind
9.3.4 Including Priors over the Future
9.3.5 Some Priors for Future Questions
9.3.6 Discussion: A general learning paradigm?
9.4 Conclusion
III Incremental Learning
10 From Transfer Learning to Incremental Learning
10.1 Introduction: Learning in Streaming Environments
10.1.1 A Recent Problem: Stream Mining
10.1.2 Introducing Concept Drift
10.1.3 Passive and Active Methods
10.2 Minimum Complexity Transfer for Incremental Learning
10.2.1 Notations for Online Learning
10.2.2 A Graphical Model for Incremental Learning
10.2.3 Remark: Estimating the Models Online
10.2.4 Classes of Models
10.2.4.1 Active Methods
10.2.4.2 Passive Methods
10.3 Algorithms
10.3.1 Dealing with Previous Models
10.3.2 An Algorithm for Continuous Adaptation
10.3.3 Experimental Results
10.4 Conclusion
11 Incremental Topic Modeling and Hybrid Recommendation
11.1 Online Topic Modeling
11.1.1 Topic Modeling
11.1.2 Adaptive windowing for Topic Drift Detection
11.1.2.1 Principle
11.1.2.2 Algorithm
11.1.2.3 Theoretical guarantees
11.1.2.4 Nature of the drift
11.1.3 Experimental Results
11.1.3.1 Datasets
11.1.3.2 Setting of AWILDA
11.1.3.3 Evaluation
11.1.3.4 Comparison of AWILDA and its variants on Sd4
11.1.3.5 Performance of AWILDA on controlled datasets
11.1.3.6 Comparing AWILDA with online LDA
11.1.4 Discussion
11.2 Incremental Hybrid Recommendation
11.2.1 Online Hybrid Recommendation
11.2.2 From Incremental Matrix Factorization to Adaptive Collaborative Topic Modeling
11.2.3 Experimental Results
11.2.3.1 Datasets
11.2.3.2 Evaluation protocol
11.2.3.3 Compared Methods
11.2.3.4 Results and Discussion
11.3 Perspective: Coping with Reoccurring Drifts
11.3.1 Reoccurring Drifts
11.3.2 Drift Adaptation seen as a CBR Problem
11.3.2.1 General Process
11.3.2.2 Case Representation
11.3.2.3 Case Retrieval
11.3.2.4 Case Reuse
11.3.2.5 Case Revision
11.3.2.6 Case Retainment
11.3.3 Application to AWILDA
11.4 Conclusion
12 U-shaped phenomenon in Incremental Learning
12.1 Context: Language Acquisition
12.2 A modeling Framework
12.2.1 Assumptions
12.2.2 A Complexity-Based Framework
12.2.3 Computing Complexities
12.2.3.1 Encoding the Grammar
12.2.3.2 Grammar Transfer
12.2.3.3 Encoding the Observations
12.3 Experimental Results
12.3.1 Causes of U-shaped Phenomenon
12.3.2 Finiteness of Memory
12.3.3 Uncorrected Mistakes
12.3.4 Discussion
12.4 Conclusion
IV Information Transfer in Unsupervised Learning
13 Introduction to Multi-Source Clustering
13.1 Reminder on Clustering
13.1.1 Definition and Issues
13.1.2 Families of Algorithms
13.1.3 Performance Measures
13.2 Multi-Source Clustering: An Overview
13.2.1 Overcoming the Individual Biases
13.2.2 Clustering in Distributed Environments
13.2.3 Multi-view Data
13.2.4 The Solution of Multi-Source Clustering
13.3 Cooperative Clustering
13.3.1 Consensus Based on Objects Co-Occurrence
13.3.2 Consensus Based on Median Partition
13.3.3 Discussion
13.4 Collaborative Clustering
13.5 Conclusion
14 Complexity-based Multisource Clustering
14.1 Graphical Model for Unsupervised Collaboration
14.1.1 Notations
14.1.2 A Model for Collaboration
14.2 Complexity of Local Clustering
14.2.1 Complexity of Prototype-Based Models
14.2.2 Complexity of Probabilistic Models
14.2.3 Complexity of Density-Based Models
14.2.4 Complexity of Other Models
14.3 Algorithm for Collaborative Clustering
14.3.1 Forgetting Consensus
14.3.2 Global Approach
14.3.3 Solution Mapping
14.3.4 Mapping Optimization
14.3.5 Dealing with Sparsity
14.4 Experimental Validation
14.4.1 Datasets
14.4.2 Experimental Results
14.5 Conclusion
15 Can clustering algorithms collaborate?
15.1 Collaboration: A Difficult Concept in the Absence of Supervision
15.2 Selecting the Best Collaborators
15.2.1 Introducing the problem
15.2.2 Optimizing the Collaboration
15.2.3 Discussion
15.3 Stability of Collaborative Clustering
15.3.1 Reminder: Clustering Stability
15.3.2 Definitions: Collaborative Clustering
15.3.3 Stability of Collaborative Clustering
15.3.4 Perspectives
15.4 Conclusion
V Conclusion
16 Conclusion
16.1 Contributions
16.1.1 General Contributions
16.1.2 Local Contributions
16.1.2.1 Analogical Reasoning
16.1.2.2 Transfer Learning
16.1.2.3 Data Stream Mining
16.1.2.4 A Cognitive Model
16.1.2.5 Multi-Source Clustering
16.2 Perspectives and Future Works
A Experiment on Hofstadter’s Analogies
A.1 Experiment Protocol
A.2 Filtering Results
A.3 Detailed Results
A.3.1 Raw Results
A.3.2 Ages
A.3.3 Results by Question
B Résumé en Français
B.1 Un problème fondamental: Le Raisonnement par Analogie
B.1.1 Introduction au Raisonnement par Analogie
B.1.2 Analogies à longueur de description minimale sur les chaînes de caractères
B.1.3 Analogies de complexité minimale
B.1.4 Analogies géométriques
B.2 De l’analogie à l’apprentissage par transfert
B.2.1 Introduction à l’apprentissage par transfert
B.2.2 Apprentissage par transfert et principe de longueur de description minimale
B.2.3 Au-delà du transfert: Apprentissage avec non-indifférence à la question future
B.3 Apprentissage incrémental
B.3.1 De l’apprentissage par transfert à l’apprentissage incrémental
B.3.2 Recommandation incrémentale hybride
B.3.3 Apprentissage incrémental en forme de U
B.4 Transfert d’information en apprentissage non-supervisé
B.4.1 Introduction au clustering multi-sources
B.4.2 Clustering multi-source de complexité minimale
B.4.3 Possibilité de collaboration pour les algorithmes de clustering
Bibliography