Models with analogical proportions
In this section, we will describe models that have made explicit use of analogical proportions.
Solving geometric problems
The ANALOGY program of Thomas Evans [Eva64] is one of the pioneer works in the design of programs capable of analogical reasoning. ANALOGY, written in LISP2, is able to solve analogical equations in the form of geometrical problems, such as that of Figure 1.1: given three geometrical patterns, choose the fourth among a list of candidates that leads to the best proportion.
Figure 1.1: A geometrical analogy problem: figure A is to figure B as figure C is to which candidate? The inputs to the program are rough, hand-built low-level descriptions of the figures. For example, simple geometric shape such as circles, rectangles or triangles are all described with one single primitive: SIMPLE_CLOSED_CURVE(…), where the arguments are the start and end of the (multiple) lines, and their curvatures. The program is then divided into two distinct parts.
The role of the first part is to build high level representations of the figures from these raw descriptions. After identifying coherent shapes (such as triangle, square, etc.) as independent objects, the program tries to find relations of the kind INSIDE(Obj1, Obj2) or ABOVE(Obj1, Obj3) between figures. Finally, similarities between every pair of objects are computed. The similarity measure is based on the transformability of the first object into the second one, using traditional geometrical transformations (rotation, scale, reflections, etc.). All the information computed by the first part are given as input (on punched cards!) to the second part.
The second part relates to the analogical solving process per se. Using the output of the first part, ANALOGY will try to find a set of rules that transform figure A into figure B. A rule would for example state REMOVE(Obj1), ROTATE(Obj2, 45◦), etc. Then, by generalizing each rule, it tries to establish a correspondence between these rules and those transforming C into one of the candidate solutions. The chosen solution is the one that maximizes the resemblance between the two sets of rules.
An interesting fact is that ANALOGY analyses the way to go from A to B and applies it to go from C to D, but it does not make use of central permutation3: it may just as well analyze the way to go from A to C and transpose it to go from B to D. Note also that in ANALOGY the semantics behind the relations and the geometrical objects are not taken into account. In this respect, ANALOGY is close to Gentner’s Structure Mapping Theory (but preexisting by far).
Analogical reasoning as distance minimization on a vector space
At a time where most models of analogy used to take the form of complex computer programs (such as that of Evans), the two cognitive scientists David Rumelhart and Adele Abrahamsen proposed a simple theoretical model of analogical reasoning [RA05]. Their model is of great interest for us because as it will become clear, their view of analogy is in total accordance with our use of analogy in machine learning (or should we say more humbly that our use of analogy is fully compliant with their earlier model).
For Rumelhart and Abrahamsen, a human reasoning process can be defined by two compo-nents: a memory structure in which the remembered objects are stored, and an algorithm that manipulates these data to produce a result. In their paper, they define these two components for the analogical reasoning.
With regards to the memory structure, their assumption is that it can be considered as an m-dimensional Euclidean space. This assumption is originally that of Henley [Hen69] who showed, with the help of social experiments, that a set of 30 mammals could be fairly well represented in the 3-dimensional space R3 with the three axes ferocity, humanness, and size. It is supposed that the semantic similarity that people associate with two concepts is inversely proportional to their distance in the Euclidean space R3. For example, cow and horse should be fairly close, but rat would likely be far from both.
Rumelhart and Abrahamsen are interested in the problem of solving an analogical equation (although they never state it in these terms): given three concepts A, B, C and a set of candidate solutions D1, D2, • • • , Dn, which Di is the best candidate for A : B :: C : Di? Their assumption is the following: for a human agent, the best solution is the closest Di to the (potentially hypothetical) perfect solution I, defined as I = C − A + B. This principle is illustrated in figure 1.2. To assess the soundness of this hypothesis, diverse social experiments are led by the authors. For example, students are asked to choose the best candidates for various randomly generated analogies, e.g. rat is to pig as goat is to [ chimpanzee, cow, rabbit, sheep].
Figure 1.2: Analogical equation solving process as in [RA05]. The solution is here D2.
As it will become clear, this idea that an analogy A : B :: C : D is better when D is close (as measured by a distance on a metric space) to the true solution I is entirely compatible with the notion of analogical dissimilarity that will be defined in Chapter 3.
More recently, various techniques have been proposed to generate a vector-space embedding for words from a large corpus of text documents, for example Word2Vect [MCCD13]. In these continuous spaces also, some interesting analogies between words can be found that give even more power to the works of Rumelhart and Abrahamsen. The most classical example of word-analogy in such a space is given by man is to woman as king is to queen, where the vector queen was found to be the one that is the closest to the vector king – man + woman, which is in perfect accordance to the idea of Rumelhart and Abrahamsen. Other meaningful analogies can be found, and some of them are illustrated in Figure 1.3. See [MYZ13] for more details on finding linguistic regularities in word space embeddings.
Figure 1.3: Analogies between words in the Word2Vect model. Image source: https://www. tensorflow.org/versions/master/tutorials/word2vec.
The Copycat program
Copycat [Mit93] (see also [HM+94]) is another famous program that performs analogical reason-ing tasks, introduced by Melanie Mitchell and Douglas Hofstadter. The problems considered by Copycat are the solving of analogical equations in a microworld of strings of letters. A typical Copycat problem looks as follows: abc : abd :: ijk : x.
What should the value of x be? Various answers may be relevant here, such as ijd (replace the right-most letter by d), but the most natural answer probably is ijl: replace the right-most letter by its successor. How about zrq? Well, not really convincing. The point of the authors, however, is that a priori every single option should be given equal chances of success.
This principle is strongly reflected in the Copycat program which is probabilistic in nature: for the same problem, various solutions can be found depending on the initial condition. For example the equation abc : abd :: mrrjjj : x leads to mrrkkk in 70% of the cases (over 1000 experiments), and to mrjjk 20% of the time. Other less plausible solutions make up the remaining 10%.
At the beginning of the program, each option is equally available. Then, on the basis of their validity, some hypotheses are given a stronger chance to survive till the end of the program, while others are discarded. Conceptually, this process emerges from the interoperability of three main components: the workspace, the codelets, and the temperature.
• The workspace, is the place where objects and relations live. In the workspace, diverse conceptual structures are built-in: successor, predecessor, left-most, right-most, orientation, etc. When the program starts, none of these structures are activated: this will be the role of the codelets.
• The codelets, are competing agents trying to explore and build perceptual structures and relations between the objects in the workspace. Their behaviour depends on the temperature.
• The temperature could be also defined as the entropy of the system. When relations in the workspace are strong and well established, the temperature is low. At the beginning of the program, the temperature is the highest, leading to a multitude of codelets being run in every single direction. This concept of temperature can be viewed as the trade-off between exploration (trying every possible hypothesis) and exploitation (the actual use of a well established hypothesis). When the temperature goes below a given threshold, the program stops and the current hypothesis is outputted.
Figure 1.4: Snapshot of Copycat during an equation solving process. Image taken from [Mit01].
Figure 1.4 illustrates the internal state of Copycat during the solving of abc : abd :: mrrjjj : x. 195 codelets have run so far, so the temperature is average and some structure have been properly established, such as the successor group abc and the sameness group jjj. The rr group is also being defined, though still weaker than the others at this stage. Also, a rule describing the change from abc to abd has been established: replace the right-most letter by its successor. Depending on the future behaviour of the codelets, this rule with either stick till the end of the program or change to lead to the most common prediction x = mrrkkk.
Copycat is clearly a complex adaptative system where a global behaviour emerges from small, independent parts. According to its authors, Copycat’s architecture is neither symbolic nor connectionist, nor a hybrid of the two; rather, the program has a novel type of architecture situated somewhere in between these extremes.
Analogy and the Minimum Description Length Principle
Ockham’s razor (also Occam), due to the Franciscan philosopher William of Ockham (1285 – 1347), is a well known principle in machine learning theory. The main and most useful interpretation of the original Latin version states that when trying to explain a situation, if two hypotheses give the same answer then the best one is probably the simplest one. In practice, what makes an hypothesis simple remains quite vague, at least from a computational point of view. Yet this principle has been formalized into Rissanen’s Minimum Description Length Principle (MDLP) [Ris78], which is based on Kolmogorov complexity. Despite the difficulty to build inferential models from this principle (Kolmogorov complexity is often intractable and impossible to com-pute, and can only be estimated), it has shown to be quite influential in the field of machine learning, at least from a theoretical point of view.
With this in mind, Antoine Cornuéjols proposed a framework for assessing the quality of an analogy [Cor96b] (see also [Cor96a]). In these papers, Cornuéjols hypothesizes that the best analogy between a source and a target model is the simplest one, meaning that its description length is minimal, in terms of Kolmogorov complexity.
Let us first recall some basic knowledge about Kolmogorov complexity, before diving into more technical details. The Kolmogorov complexity of a string of characters x, denoted K(x), is the length of the shortest computer program capable of outputting x on a universal Turing machine. K(x) is supposed to capture the intrinsic complexity of x. Intuitively, K(0aaaaabbbbb0) is supposed to be lower than K(0abaabbabab0), because a clear pattern emerges in the first string, leading to a simple program: first print a five times, then do the same for b. The second string seems more or less random, which makes it difficult to factorize into a concise program. In some sense, the Kolmogorov complexity captures how well a string x can be compressed. The conditional complexity K(x | y) is the size of the shortest program that outputs x when given y as an input.
Now, let’s get back to our analogical concerns. As illustrated in figure 1.5, Cornuéjols considers an analogy as a process involving an object xS in a source domain, an object xT in a target domain, and two functions fS and fT transforming xS and xT into yS and yT respectively: yS = fS (xS ) and yT = fT (xT ). Each domain S and T lives inside a theory or model (namely MS and MT ), that can describe their corresponding objects.
For Cornuéjols, the best analogy is the one that minimizes the following sum of Kolmogorov complexities: K(MS ) + K(xS | MS ) + K(fS | MS ) + K(MT | MS ) + K(xT | MT ) + K(fT | MT ),
• K(MS ) is the cost associated with the source theory,
• K(xS | MS ) is the cost of xS as described in the source theory,
• K(fS | MS ) is the cost of fS as described in the source theory,
• K(MT | MS ) is the cost of describing the target theory from the source theory,
• K(xT | MT ) is the cost of xT as described in the target theory,
• and finally K(fT | MT ) is the cost of fT as described in the target theory.
Notice that the terms yi are not considered in this cost function, because they are entirely defined by their corresponding xi and fi.
Cornuéjols illustrates the plausibility of his model using experiments in the microworld of Copycat. After defining the Kolmogorov complexity of the built-in relations (direction, length, etc.), and those of the representations of the inputs, the solutions of the analogical equations are set as those minimizing the above criterion. The empirical results show that this model, in addition to its theoretical appeal due to the proximity with MDLP, is at the very least an interesting option that deserves further investigations. In a recent work, this model of analogy has been applied to a task of transfer learning [MC16].
Adopting a different (but close) approach, the association between Kolmogorov complexity and the evaluation of the quality of an analogy has also been recently advocated in [BPR12]. In this work, the authors propose that if four objects a, b, c, d are in proportion, then it should be as complicated to go from a to b as to go from c to d, and it should also be as complicated to go from b to a as to go from d to c. More formally: a : b :: c : d =⇒ K(b | a) = K(d | c) and K(a | b) = K(c | d).
This criterion was used to automatically discriminate good analogies from bad analogies, where objects were concepts represented as words of the natural language, and where their Kolmogorov complexity was roughly approximated using online search-engine queries. The experiments showed that the system was able to differentiate credible analogies from fishy ones in about 75% of the time.
We will here describe the field of Case-Based Reasoning (CBR). First of all, let us note that CBR is not literally a computational model of analogy, but is rather a general problem solving paradigm that makes use of some particular form of analogical reasoning. As discussed in [AP94], CBR and analogical reasoning are often mixed up in the literature. We choose to briefly describe CBR now, because we need to clarify how our approach to problem solving will differ from that of CBR.
CBR is a very general paradigm for problem solving, where a new problem is solved using the knowledge of some similar past experiences. This model can be motivated by its psychological appeal: it is indeed well acknowledged that, when confronted with an unknown situation, humans tend to make use of their past knowledge about some other similar situation to solve the problem at hand. As an example, the recipe of a strawberry pie could be easily inferred using the past knowledge on the recipe of other fruit pies (e.g. apple pie, pear pie, etc.).
In its most general form, CBR is a way of reaching the unknown solution S to a problem P using the known solutions Si to problems Pi that are similar to P . Each pair (Si, Pi) is called a case, hence the name of the paradigm. The term problem is here used in its most general sense. In [AP94], a CBR process is described by the succession of the following four steps:
1. Retrieve: this first step is to retrieve all cases (Si, Pi) where Pi is similar to the initial problem P . The way these cases are chosen is entirely dependent of the current setting, and greatly differ depending on the application. Sometimes, only a single case is selected.
2. Reuse: the solutions Si associated to the most relevant Pi are aggregated in some way, and adapted if needed to build up the final solution S.
3. Revise: the relevance and the adequacy of the chosen solution S are evaluated, and S is modified if needed.
4. Retain: if the revised solution S ended up being a good choice, then the solution is retained and the pair (P, S) is bound to be used in the solving of some future problems.
Before moving further to the next chapter, we would like to point out a significant difference between the CBR approach and the one we will address in this document. In a way, the CBR process can be viewed as building the following analogical proportions: Pi : Si :: P : S, where S is unknown and inferred from all the Si. We can indeed consider that problem Pi is to solution Si as problem P is to solution S. Each of these analogical proportions is built on the basis that Pi and P are similar, so S should be similar to Si.
But as we will see in the next chapter, an analogical proportion a : b :: c : d is not restricted to having a close to c and b close to d! It is quite the contrary actually: all elements are allowed to be extremely distinct. And indeed, if we are able to capture how a given Pi (that is not close to P ) differs from P , then applying the same transformation to Si should give us a relevant potential solution S. Acknowledging this fact, our approach to analogical learning will rather follow another conceptual scheme: we will try to retrieve the proportions of the form Pi1 : Pi2 :: Pi3 : P , and the solution S will be inferred by extrapolation4 from the associated equations: Si1 : Si2 :: Si3 : S. This allows us to consider problems Pi that are both similar and different from P , which may allow to find more interesting solutions S.
This first chapter was devoted to the description of past works on the modeling of analogical reasoning. We chose here to distinguish between two groups of models: those that make explicit use of analogical proportions (either as a key component of the model or as a simple use-case example), and those that do not. Among the models that use analogical proportions, we will see that the model of Rumelhart and Abrahamsen (derived from a psychological perspective) will be in perfect accordance to our view of analogical classification, and analogical learning in general.
It is fairly clear that these past works are, with a few exceptions, mostly motivated by cognitive and psychological investigations. It is only recently that some authors [FPY95, LS96] have provided a formal framework to deal with analogical proportions in general algebraic settings, which opened the door to many computational applications. These formal analogical proportions are a key concept for our work. We will now review this topic in the next chapter.
Table of contents :
1 Computational models of analogical reasoning
1.1 Models without analogical proportions
1.1.1 First attempts by Pólya and Hesse
1.1.2 Analogy as a structural mapping
1.1.3 Logical view of analogical reasoning
1.2 Models with analogical proportions
1.2.1 Solving geometric problems
1.2.2 Analogical reasoning as distance minimization on a vector space
1.2.3 The Copycat program
1.2.4 Analogy and the Minimum Description Length Principle
1.2.5 Case-Based Reasoning
2 Formal analogical proportions
2.1 Formal definitions
2.1.1 The geometric and arithmetic proportions
2.1.2 Analogical proportions between sets
2.1.3 The Boolean proportion
2.2 Geometrical insights of the Boolean and arithmetic proportions
2.2.1 Equivalence classes of analogical proportions
2.2.2 The Boolean and arithmetic proportions as parallelograms
2.3 Analogical inference principle and equation solving
3 A functional definition of analogical classifiers
3.1 Analogical classification
3.1.1 Conservative classifier
3.1.2 Extended classifier
3.1.3 Analogical classifier: a functional definition
3.2 Theoretical properties of analogical classifiers
3.2.1 Study of convergence
3.2.2 Vapnik Chervonenkis (VC) dimension of analogical classifiers
3.2.3 Accuracy analysis
3.3 Experiments and empirical validation
3.3.1 Evaluation protocol
3.3.2 Comments and discussion
4 Background on recommender systems
4.1 Taxonomy of recommender systems
4.1.1 Content-based techniques
4.1.2 Collaborative filtering
4.1.3 Knowledge-based systems
4.2 Recommendation by rating prediction
4.2.1 Problem formalization and notation
4.2.2 Recommender system evaluation
4.3 Two collaborative filtering techniques: neighborhood-based, and matrix factorization 87
4.3.1 The neighborhood approach
4.3.2 Matrix factorization techniques
5 Analogical recommendation
5.1 An algorithm using arithmetic proportion
5.1.2 Experiments and results
5.2 A “clone”-based view of analogical recommendation
5.2.1 Two new analogical algorithms
5.2.2 Current advances in neighborhood-based techniques
5.2.3 Experiments and discussion
5.2.4 Towards an ordinal view of ratings
5.3 Mining analogical proportions
5.3.1 Association rules
5.3.2 Looking for analogies in an incomplete database
5.3.3 Experiments and discussion
6 Analogy-preserving functions
6.1 Extending a training set
6.1.1 Previous works on training set extension
6.1.2 Analogy preserving functions: a safe way to extend a training set
6.2 A complete characterization of analogy preserving functions
6.2.1 Preliminary background on Boolean functions
6.2.2 The class of Analogy Preserving functions
6.2.3 The class of Strongly Analogy Preserving functions
6.3 Beyond Boolean Analogy Preserving functions
6.3.1 Approximately AP functions and experiments
6.3.2 Extension to nominal attributes
6.3.3 Extension to real-valued functions
List of Tables
List of Figures
List of Definitions