Pattern structures and interval pattern structures
FCA provides a mathematically sound framework for dealing with binary data. However, to deal with more complex data within FCA one needs to binarize data. The choice of an appropriate binarization can be very diﬃcult and often accompanied by loss of information. Pattern structures were proposed to tackle this issue and to deal directly with complex data in compliance with the FCA theory [GK01].
Definition 6. Let G be some set, (D; u) be a meet-semilattice2 and : G ! D be a mapping. Then a pattern structure [GK01] is a triple P = (G; (D; u); ) provided that the set (G) = f (g) j g 2 Gg generates a complete subsemilattice (D ; u) of (D; u).
To ensure that the subsemilattice (D ; u) is complete it is suﬃcient that one of the following conditions be satisfied: (i) (D; u) is complete, or (ii) G is finite.
For a pattern structure (G; (D; u); ) the derivation operators, denoted ,that make a Galois connec-tion, are defined as follows: Y A := (g) for A G; d := fg 2 G j d v (g)g for d 2 D: g2A.
A is a common description to all objects in A, while d is a set of objects whose descriptions subsume d. The (subsumption) order on the descriptions is given by c v d , c u d = c.
Definition 7. A pattern concept of a pattern structure (G; (D; u); ) is a pair (A; d), where A G and d 2 D such that A = d and d = A. A is called the pattern extent and d is called the pattern intent.
Pattern structures can be defined on a wide variety of data types, e.g., sequences [CBK+17], parti-tions [BKN14a, CN14], convex polygons [BKRK17], graphs [KS05], parse thickets (i.e., syntactic trees augmented with arcs reflecting inter-sentence relations) [SGIK14], etc.
In this study we use a special type of pattern structures called interval pattern structures [KKN11b] that were introduced to deal with numerical data. The pattern concepts in the interval pattern structures are called interval pattern concepts. Simply speaking, the interval pattern concepts are hyper-rectangles in multidimensional space. In Chapter 5 we propose a pattern mining approach where we use the hyper-rectangles as patterns.
Within the formalism of pattern structures, we may represent numerical data as a many-valued context Knum = (G; S; W; Inum), where (g; s; w) 2 Inum means that object g has value w for attribute s.
Considering a dataset over m numerical attributes a pattern intent is a tuple of m = jSj intervals, i.e. ranges of numerical values defined by a lower bound and an upper bound denoted [l; u]. Let c = h[ai; bi]ii2f1;:::mg and d = h[ei; fi]ii2f1;:::mg be two interval pattern intents. For c and d the similarity (meet) operator is defined as follows: c u d = hmin(ai; ei); max(bi; fi)ii2f1;:::mg. The subsumption order therefore is defined by c v d , [ei; fi] [ai; bi], 8i 2 f1; : : : ; mg.
Example. An example of numerical context is given in Fig. 2.2 on the left, the corresponding pattern structure is given on the right. Broadly speaking, given a set of objects that constitute the extent of a pattern, the pattern intent is the smallest rectangle that envelop the pattern extent. For example, the smallest rectangle that envelops fg1; g3g is [1; 2] [3; 4]. This rectangle includes object g2 as well, thus the pattern extent is fg1; g2; g3g, and the interval pattern concept is given by (fg1; g2; g3g; h[1; 2]; [3; 4]i).
The meet u of two interval pattern intents is the smallest rectangle that envelops both of them. Let us consider two interval pattern concepts (fg1; g2; g3g; h[1; 2]; [3; 4]i) and (fg1; g4g; h[1; 4]; [3; 3]i). Then, the meet of their intents is h[1; 2]; [3; 4]i u h[1; 4]; [3; 3]i = h[1; 4]; [3; 4]i.
A many-valued context can be transformed into a binary context in several ways, some of which do not entail any loss of information (at the cost of an explosion of the number of attributes). The lossless data transformation is called interordinal scaling (IS) [GW99]. The interordinal scaling consists in replacing each value v of attribute m by two binary attributes: “m v” and “m v”. An interordinally scaled numerical context from Fig. 2.2 is given in Fig. 2.3. This formal context contains a lot of redundant attributes, e.g., “m1 1”, “m1 4”, “m2 3” and “m2 4”. In [KKN11b] the authors point out not only the local redundancy of IS-itemsets (i.e., there may exist several arbitrary IS-itemsets that correspond to a single interval pattern intent) but also global (i.e., the number of IS-generators is not less than the number of generators of interval patterns).
The reduced context is given in Fig. 2.3 below the IS formal context. We removed the attributes with the repetitive extents (i.e., the repetitive columns of the formal context). However, this context is not minimal, meaning that removing the attributes “b” and “e” does not change the lattice structure. In practice, especially for real-valued data, one usually scales data and applies one-hot encoding. Scaling consists in defining a partition, i.e., choosing a limited number of values as cut-points for each attribute rather than using all values that occur in the data, while one-hot encoding consists in replacing a particular value with the interval containing this value. We give an example of scaling and one-hot encoding in Fig. 2.4. There are a lot of techniques to define the partition (i.e., to discretize data), we discuss them in detail in Section 5.2.1.
Minimum description length principle
The minimum description length principle (MDL) is a principle that is widely used in data mining and knowledge discovery for model selection [Gal20]. Often, the methods based on MDL are referred to as “learning by compression”, since “learning” in these approaches is equated to “finding regularities”, and the more regularities are found, the more the data can be compressed.
MDL is originated from the notion of the Kolmogorov complexity [VL97]. The Kolmogorov complexity of a data sequence is defined as the length of the shortest program that prints this sequence and then halts. The basic concept related to the Kolmogorov complexity is a computing machine (or, in more practical terms, “programming language”). In [Sol64] Solomonoﬀ described the concept of a “universal machine” MU as a subclass of the universal Turing machine such that for any arbitrary computing machine M and any string x there exists a string , as a function of MU and M, such that MU (cx) = M(x), where M(x) is the output of M given input x, and cx is the concatenation of strings and x. Thereby can be viewed as a translation instruction.
In the same work, Solomonoﬀ proved that asymptotically for any two universal machines M1 and M2, the diﬀerence between the shortest code lengths for the same input sequence is finite and depends only on M1 and M2. The latter means that any program of one universal machine can be translated to the equivalent program for another universal machine with a program of a finite length. This result was obtained independently by Solomonoﬀ [Sol64], Kolmogorov [Kol65], and Chiatin [Cha69], and now is known as the invariance theorem.
However, the Kolmogorov complexity is known to be uncomputable in general, and in practice, the length of the translating program (mentioned in the invariance theorem above) can be too large w.r.t. the length of the data sequence x. To circumvent this issue, in practice one usually resorts to a simplified version of this problem: one uses a restricted class of description methods (or “machines”) such that the length of the shortest description (or “program”) be computable. For the sake of uniformity, further, we use the notion of “model” for a description method.
Hence, in practical MDL we deal with a restricted class of models M. One distinguishes crude and refined versions of MDL.
The crude version is formulated as follows: given a class of models M, select one model M 2 M that minimizes the total length, i.e., M = arg minM2M(L(M) + L(DjM)), where L(M) is the length, in bits, of the description of the model; and L(DjM) is the length, in bits, of the description of the data when encoded by this model.
Multiple pattern mining methods are based on the crude version of MDL since it allows for selecting a specific model, i.e., a pattern set. The crude version of MDL is used to encode various pattern types, e.g., sequences [CAP18], graphs [BCF20], rules [PvL20], etc.
In contrast to the crude version of MDL, the refined one is used encoded with the entire model family Refined MDL is based on a one-part code length L(DjM). This code is designed in such a way that wheneverM 2 M ensures small L(DjM) then the code length L(DjM) will also be small [Grü07]. The class of models M is usually parametric, and, in the literature, M may be referred to as a single full model, where M corresponds to some values for the parameters of this model. In this work, for the sake of uniformity, we call M a class of models both for crude and refined MDL. In the refined MDL, L(DjM) is called the stochastic complexity of the data given the class of models. Apart from the stochastic complexity, another fundamental notion of the refined MDL is parametric complexity COM P (M). It characterizes the “richness” of the model class M. The both complexities are ^ ^ related as follows: L(DjM) = L(DjM) + COM P (M), where M denotes the distribution in M which j ^ minimizes the length of data L(D M). Thus, the stochastic complexity is decomposed into the terms characterizing goodness-of-fit and complexity of the model class.
Quality measures in pattern mining
As mentioned above, the main diﬃculty in the evaluation of the results of pattern mining is the lack of ground truth. Put diﬀerently, the actual patterns are unknown. In some cases, when several miners use the same objective, the evaluation process can be performed by considering the values of the chosen interestingness measure. For example, in MDL-based methods, the compression ratio may be chosen as a quality measure. However, an interestingness measure does not always reflect fully the common intuition on the quality results.
A straightforward method to mitigate this issue is to apply quality measures that reflect the desired properties. We consider the quality measures within the following categorizations:
– measures for a set of patterns and for a pattern set.
– measures for evaluating patterns in unsupervised and supervised settings.
The key diﬀerence between measures for a set of patterns and a pattern set is that the first measures are applied to each pattern individually and then aggregated across the set, without taking into account overlaps and interactions between the patterns. Thus, the quality of the set of patterns usually is an aggregated value of a certain quality measure, while measures for pattern sets are applied to the whole pattern set, thus patterns are evaluated w.r.t. other patterns.
The diﬀerence between methods of the second group is that the “supervised” measures are applied in the case where the ground truth is known, while for the “unsupervised” measures no additional information is required.
Usually, the quality measures formalize some intuition of the analysts about interestingness. For example, such characteristics as diversity, coverage, meaningfulness, etc. In [KH06b] the authors propose the following criteria of interestingness:
– no two patterns should cover (approximately) the same set of examples.
– no pattern should cover (approximately) the complement of another pattern.
– no pattern should cover (approximately) a logical combination of two or more other patterns;
– patterns should be (approximately) mutually exclusive.
– when using patterns as input to a classifier, the pattern set should lead to the best performing classifier.
– patterns should lie on the convex hull of all patterns in ROC-space.
In this section we do not attempt to provide an exhaustive list of possible formalizations but rather provide quite intuitive and commonly used quality measures.
Quality measures in supervised settings
The quality measures considered in the previous section evaluate patterns w.r.t. the dataset and other patterns in the pattern set. To assess how meaningful patterns are, some background knowledge is required. Among possible background knowledge, the class labels are widely available information to perform this task. To evaluate patterns using the class labels, one may adopt quality measures for classifiers or even build a classifier and assess its quality. Here we distinguish the following categories of quality measures applicable in supervised settings:
– measures for a set of patterns (i.e., applied to each pattern individually) and a pattern set (i.e., applied to the whole set).
– measures to assess how well patterns describe the data (i.e., assessment of how patterns “fit” the data) and how well patterns predict the classes of new instances (i.e., ability of patterns to generalize the knowledge extracted from the training set).
– measures to evaluate the quality of patterns themselves or classifiers built based on these patterns.
Regarding the evaluation of the descriptive and predictive ability of patterns, we emphasize that the class labels are not available during pattern mining and are used only for assessing the pattern quality. Thus, the evaluation of patterns with the class labels on the same dataset that they were built (i.e., the “training set”) allows us to estimate how well a particular pattern fits the data w.r.t. the class label, while the comparison of the quality of pattern on the training and test set allows for assessing the generalization ability of patterns.
We also emphasize the diﬀerence between approaches to evaluation of patterns themselves and classifies built on these patterns. In the first case, we minimize the impact of a classification model and assess patterns “as they are”. In the second case, we assess how useful patterns may be for building a classification model. However, there are several details that should be taken into account: (i) a particular classification model may be more suitable for patterns computed by a particular pattern mining method, or even for particular dataset, (ii) the quality of the classifiers may be improved by tuning parameters of classifiers, or by slightly changing the classification model. Hence, in the second case, we do not evaluate patterns directly but rather evaluate them w.r.t. the chosen classification model.
Let us consider the most common approaches. We begin with the simplest one — quality measures applied to patterns themselves.
Table of contents :
Chapter 1 Introduction
1.1 Outline of pattern mining
1.2 Thesis structure and contribution
Chapter 2 Background
2.1 Formal concept analysis
2.2 Pattern structures and interval pattern structures
2.3 Minimum description length principle
2.4 Quality measures in pattern mining
2.4.1 Pattern set quality measures in unsupervised settings
2.4.2 Quality measures in supervised settings
Chapter 3 Closure structure
3.2 Related work
3.3 Basic notions
3.4 Closure structure and its properties
3.4.1 Keys and passkeys
3.4.2 Closure structure
3.4.3 Passkey-based order ideal
3.4.4 Assessing data complexity
3.4.5 Closeness under sampling
3.5 The GDPM algorithm
3.5.1 Computing the closure structure with GDPM
3.5.2 The extent-based version of GDPM
3.5.3 Complexity of GDPM
3.5.4 Related approaches to key-based enumeration
3.5.5 Computing passkeys. Towards polynomial complexity
3.6.1 Characteristics of the datasets
3.6.2 Computational performance
3.6.3 Data topology or frequency distribution within levels
3.6.4 Coverage and overlaps
3.6.5 Usefulness of concepts
3.6.6 Case study
3.7 Discussion and conclusion
Chapter 4 Pattern mining in binary data
4.1.1 Pattern types
4.1.2 Exploring the pattern search space
4.1.3 Interestingness measures
4.2 Minimum description length principle in itemset mining
4.3 Greedy strategy to reduce pattern search space
4.3.1 Likely-occurring itemsets
4.3.2 Likely-occurring itemsets and related notions
4.4 Adapting the best practices of supervised learning to itemset mining
4.4.1 KeepItSimple: an algorithm for discovering useful pattern sets
4.5 Discussion and conclusion
Chapter 5 Pattern mining in numerical data
5.2 Related work
5.2.1 Numerical data preprocessing
5.2.2 MDL-based approaches to pattern mining
5.3 Basic notions
5.3.1 Formalization of data and patterns
5.3.2 Information theory and MDL
5.3.3 Derivation of the plug-in codes
5.3.4 ML-estimates of the parameters of the multinomial distribution
5.4.1 The model encoding
5.4.2 The Mint algorithm
5.6 Discussion and conclusion
6 Conclusion and future work