Learning Top-Down Tree Transducers with Regular Inspectio

Get Complete Project Material File(s) Now! »

Learning DTops with Regular Inspection

We show that DTops with regular domain inspection (DTopIreg) can be learned in a Gold-style model with polynomial resources. The regular domain of the target transformation is a parameter of the learning problem. The polynomial complexity, however, do not directly depend on the size of the target DTopIreg. Rather, it is based on the number of semantic alignments, a notion we introduce to get Myhill-Nerode–like equivalence classes. That number may be exponential in the size of the target DTop in the worst case.
Our learning result extends the one for DTops with top-down inspection (DTopItd) from Lemay, Niehren, and Maneth [Lemay et al., 2010], which in turn is based on a normal form for these transducers from Engelfriet, Maneth, and Seidl [Engelfriet et al., 2009]. In this case, the normal form of a DTopItd ensures that the DTop works in synchronization with the top-down deter-ministic tree automaton recognizing its domain. This is no longer possible for regular domains, as these cannot always be defined by top-down deterministic tree automata (while bottom-up determinism would be suﬃcient). The extension to regular schemas is relevant for two main reasons. First, it allows to consider DTops with restrictions by RelaxNG schemas, since these can express all regular tree languages (in contrast to Xml Schema). This higher expressiveness is appreciated in applications such as Sbml, whose earlier versions were defined by an Xml Schema, and whose current version is defined by an RelaxNG schema. Furthermore; the ranges of Web publishing transformations are often regular, but not top-down. As an example, let us consider a Web publication task that sends Xml files following a top-down schema to Html files belonging to a schema that has a data constraint (see Figure 1.7), because it copies an information from the input in two diﬀerent leaves of the output.

Learning Rational Functions

We propose a learning algorithm for deterministic subsequential word trans-ducers with lookahead (Dwt‘). This class corresponds to all rational functions on words [Elgot and Mezei, 1965]. Not only was learning rational functions an open problem, it is also to our knowledge a rare occurrence to learn a class of transducers with lookahead. The main diﬃculty we need to circumvent to learn the normal form of [Reutenauer and Schützenberger, 1991] is that its equivalence class is hard to learn with a finite number of examples: two suﬃxes are equivalent, and can be sent to the came state of the lookahead, if replacing one with the other at the end of a word only creates a bounded diﬀerence in the output. To palliate with this problem we say that there is a bound m such that if two suﬃxes creates more than m diﬀerences in the output, then they create arbitrarily big diﬀerences in the output. While this bound exists for all rational functions, it is not easily known or learned. If we are to underestimate this bound, then we would learn a lookahead finer than the one of the normal form of [Reutenauer and Schützenberger, 1991]. The solution we propose is a to learn a normal form that does not necessarily have a minimal lookahead, but that can be learned by guessing the bound m, the number of states n the Dwt‘ would need, then trying to build a Dwt‘ under n states while assuming any suﬃxes creating more than m output diﬀerences are not equivalent. If it fails, the algorithm tries again with a bigger bounds m and n.
The result of this algorithm might not have a minimal lookahead, if it manages to find a Dwt with a reasonable number of states to complete its lookahead. When compared to the normal form of [Reutenauer and Schützenberger, 1991], it strikes a bargain between the roles of the lookahead and of the Dwt. Whether the normal form of [Reutenauer and Schützenberger, 1991] can actually be learned or not is unknown: from our normal form, finding a Dwt‘ with minimal lookahead is tantamount to lookahead minimization. As previ-ously mentioned, even deciding if a lookahead can be completely removed is a challenging problem. We also note that the problem of finding a normal form for DTops with lookahead remains open: the techniques of this paper do not translate easily into trees.

Linear Tree to Word Transducers

We present Linear tree-to-word transducers (Ltws), whose only constraint is to not create copies of an input subtree. On this class, we provide a Myhill-Nerode algorithm, leading to an earliest minimal normal form. Furthermore, the equivalence problem on these linear transducers remains decidable in poly-nomial time, through a non-trivial reduction to the Stw case. The lift of the restriction forbidding reorderings in Stws is a necessary step to study tree-to-word transducers, and hope to find a normal form on this class. The possibility of reordering subtrees lead to a complication for normal-ization of Ltws, as two transducers can be equivalent but use diﬀerent re-orderings. The notion of earliest transducers can be directly imported from Stws, but it is no longer enough to define a normal form. To fix this issue we show that two earliest Ltws can be equivalent while using diﬀerent reorderings if and only if they exclusively reorder periodic pro-ductions of same period. By then imposing an order on subtrees whenever a periodic production leaves the choice open, we provide an ordered earliest normal form for Ltws.
The adaptation of the polynomial equivalence test proves to be more chal-lenging. In the Stw case, the strategy is to reduce the problem to the mor-phism equivalence problem on a context-free grammar [Plandowski, 1995]. This means that the fact that both transducers use the same reordering is central to the reduction. The characterization of possible reorderings we established is only made for earliest Ltws, and making a Ltw earliest can only be made in exponential time. Hence the computation of the normal form or even of a fully earliest Ltw is out of the question to preserve a Ptime algorithm. The solution we propose is to only compute the earliest form for the few states that can possibly be reordered, i.e. those whose production can become periodic in an earliest Ltw. These states are both easier to make earliest, using word combinatorics results, and the only relevant one to reduce Ltw equivalence to Stw equivalence. We finally prove this reduction to be in Ptime, which leads to a Ptime algorithm to decide Ltw equivalence.
We do not present the learning algorithm for the normal form we provide on Ltws. It would likely involve an adaptation of the learning algorithm on Stws [Laurence et al., 2014], but requires additionally to learn in polynomial time in what order the subtrees’ images should be sorted. The extension on the normal form to the general tree-to-word transducers case remains open, and is likely to be complicated. The main challenge is likely to reside in identifying which fragment of the output comes from which subtree in the input.

Transducers for Data Trees

Most classes of transducers we presented model Xslt’s ability to read, test and transform a tree’s structure, but very few concern themselves on data-centric question. An Xslt program can do some tests on the data contained in the input tree, and copy or modify them when producing its output. These abilities are both diﬃcult to model, as it forces to forsake the ad-vantages of working on finite input and output signatures, but are also known to be diﬃcult to deal with for static analysis questions (see for example [Bojanczyk et al., 2011]).
Symbolic automata and transducers [Veanes et al., 2012, Veanes and Bjørner, 2011] work on data words and trees, where each letter or node is labeled by some data. These classes are defined modulo a background theory that defines what tests and operations can be performed on the labeling data. For example, if our nodes are annotated by integers, the Presburger theory defines Presburger symbolic automata and transducers.
One of the major draws of this class is that they work on a possi-bly infinite signature while still oﬀering the possibility of using classical formal languages methods, as long as the background theory. Notably, [D’Antoni and Veanes, 2014] presents a minimal normal form for symbolic word automata. On word transducers, [Veanes et al., 2012] proves that under the condition of background theory decidability, there exists algorithms to decide the equivalence of symbolic word transducers. While this result does not translate entirely into tree transducers, [Veanes et al., 2012] provides a decision algorithm for the equivalence of symbolic tree transducers under the condition that they are linear (no rule can use the same subtree twice).

Table of contents :

1 Introduction
1.1 Motivation
1.1.1 Semi-Structured Data
1.1.2 Xml Document Processing
1.1.3 Transformation Learning
1.1.4 Tree Transducers
1.2 Limitations of Transducer Learning
1.2.1 Schema Restrictions
1.2.2 Lookahead
1.2.3 Output Concatenation
1.3 Contributions
1.3.1 Learning DTops with Regular Inspection
1.3.2 Learning Rational Functions
1.3.3 Linear Tree to Word Transducers
2 Related Work on Transducers
2.1 Mso Transformations
2.1.1 Words
2.1.2 Trees
2.2 Bottom-Up Transducers
2.3 Transducers for Data Trees
3 Preliminaries
3.1 Words
3.1.1 Word Automata
3.1.2 Determinism
3.1.3 Myhill-Nerode Theorem
3.2 Trees
3.2.1 Tree Automata
3.2.2 Determinism
3.2.3 Myhill-Nerode Theorem
3.3 Automata Learning
3.3.1 Learning Model
3.3.2 Characteristic Samples
3.3.3 Learning Algorithm Rpni
4 A Learning Algorithm for Top-Down Tree Transducers
4.1 Introduction
4.2 Illustration of Ideas
4.2.1 Myhill-Nerode Theorem for DTops
4.2.2 Example Transformation
4.3 Top-Down Tree Transducers
4.3.1 Top-Down Tree Transducers
4.3.2 Top-Down Domains
4.3.3 Domain Inspection
4.4 Syntactic Equivalence
4.4.1 Syntactic Alignment
4.4.2 Trimmed Transducers
4.4.3 Origins of Output Constructors
4.4.4 Syntactic Equivalence
4.5 Compatible Transducers
4.6 Earliest Transducers
4.7 Semantic Equivalence
4.7.1 Semantic Alignments
4.7.2 Semantic Equivalence
4.7.3 Semantic Successors
4.8 Unique Normal Forms
4.9 Learning from Examples
4.9.1 Learning Model
4.9.2 Characteristic Samples
4.9.3 Learning Algorithm
4.10 Remaining Proofs
4.10.1 Trimming a DTopI
4.10.2 Origins of Output Constructors
4.11 Conclusions
5 Learning Top-Down Tree Transducers with Regular Inspection
5.1 Introduction
5.2 Running Example
5.3 Semantic Equivalence
5.4 Unique Normal Forms
5.5 Learning from Examples
5.5.1 Characteristic Samples
5.5.2 Learning Algorithm
6 Learning Rational Functions
6.1 Introduction
6.2 Rational Functions
6.3 Transducers with Look-Ahead
6.4 Building the Look-Ahead Automaton
6.5 The Learning Algorithm
6.6 Characteristic Samples
7 Normal Forms of Linear Tree-to-Word Transducers
7.1 Introduction
7.2 Linear Tree-to-Word Transducers
7.3 Earliest Linear Transducers
7.4 Reordering in Earliest Transducers
7.4.1 Reordering Erasing States
7.4.2 Reordering Producing States
7.5 Myhill-Nerode Theorem and Normal Form
7.6 Conclusion and Future Work
8 Equivalence of Linear Tree-to-Word Transducers is in Polynomial Time
8.1 Introduction
8.2 Preliminaries
8.3 Linear Tree-to-Word Transducers
8.4 Partial Normal Form
8.4.1 Eliminating Non-Earliest Quasi-Periodic States
8.4.2 Switching Periodic States
8.4.3 Testing Equivalence in Polynomial Time
9 Conclusion
Bibliography