A mixed node-graph tree-based semantic patch inference

Get Complete Project Material File(s) Now! »

A mixed node-graph tree-based semantic patch inference

The title of this section sums up the characteristics of Spinfer’s semantic patch inference: it works with the CFG and considers the CFG nodes as ASTs, and the main inference is tree-based since it is performed on the CFG nodes. During the reading of this section, one can refer to Figure 1 for an overview of the different phases of the workflow of Spinfer.

Merged CFG

Identifying and aligning the differences between two source code file is non-trivial. We preferred to leave this task to the traditional UNIX diff, which indicates the line numbers which have been modified in the code “before” and “after”. Using the information gathered during the parsing of both versions of the code and the result of the diff, we are then able to tag the nodes of both CFG with the following rules:
• if a node in the “before” CFG is associated with a line that diff reports as being removed, we tag the node with -;
• if a node in the “after” CFG is associated with a line that diff reports as being added, we tag the node with +;
• if a node is not associated with any line that diff reports as being removed or added, we tag the node with 0.
Next, we merge the two CFGs into a single one, based on this tag information. The underlying assumption of this algorithm is that since nodes tagged with 0 are not modified, they appear in both CFGs but they will appear only once in the merged CFG. On the other hand, if a line of code is modified, it will be tagged by – in the “before” CFG and tagged by + in the “after” CFG. Then, the two versions of the node will appear in the merged CFG, each version linked to its correct predecessors and successor who also appear in the merged CFG, whether they are modified or not. Figure 2b illustrates the behavior of this algorithm. The algorithm is implemented in a single parallel traversal of both the “before” and “after” CFGs (see Algorithm 1). The general principle is the following: as long as we traverse nodes that are not modified, the parallel traversal is synchronized between the two graphs. But when for instance we encounter a removed node on the “before” CFG that is not present (by definition) in the “after” CFG, we stop our traversal of the “after” CFG while continuing in the “before” CFG until we find in the “before” CFG the node at which we stopped at in the “after” CFG. We are sure to reach this point because if the node we stopped at in the “after” CFG is not modified, then we are sure it will also appear in the “before” CFG. Then, we resume our synchronized parallel traversal. Another solution for aligning the differences of the CFGs could have been to use a tree-differencing algorithm such as GumTree by Falleri et al. [3] to align the differences of the AST of the code “before” and “after” and then transformed this merged AST into a merged CFG.

Extracting common patterns

The semantic patches we want to produce are made of a sequence of lines of code separated by …, similar to Spdiff’s output illustrated in Listing 4. These lines of codes, which possibly contain metavariables, are called patterns. We say that a pattern matches a CFG node if the AST of the pattern and the AST of the CFG node are the same; a metavariable is the same as any subtree. For instance, X0=foo(X1->X2) is a pattern, as well as X0=X1(X2). Both may match x = foo(y->z). The second pattern is said to be more general than the first one because all the nodes matched by the first one are matched by the second one. We can also say that the second pattern matches the first pattern.

Abstraction of patterns

We want Spinfer to produce safe semantic patches: it means that if we apply the semantic patch output by Spinfer on the same modified code it has been inferred from, it should not perform any modifications that are not part of the original set of modifications. The difficulty when building patterns is to get to the right level of abstraction to represent multiple modifications in a single pattern while ensuring the safety property. Andersen et al. [1] addresses this issue by computing all the different possibilities of abstraction of a semantic patch and choosing the safe pattern that is the most concise. Even with the pruning strategies implemented by Andersen et al., we found empirically that this algorithm does not terminate in many cases in a reasonable amount of time. But in this section, we deal with patterns and not whole semantic patches. Since the safety of the semantic patch depends on the whole semantic patch and not on each of its patterns individually, the pattern mining algorithm presented in the next section to generate patterns does not consider the safety of the patterns it produces. For instance, the pattern – X0=X1(X2) alone may be unsafe because it would match a too large number of nodes in the original code; but if we add the pattern foo(X0) as a context, then the whole semantic patch may become safe.
The abstraction strategy we take targets conciseness; the pattern mining algorithm presented in the next section can be seen as a provider of raw pattern material to the next phase of semantic patch inference which will actually produce semantic patches that are safe.

Pattern mining

We have a forest of ASTs, each corresponding to a single CFG node, and want to extract patterns from it. The details are in Algorithm 2. We first run the same clone detector as in Section 2.1.3 on the forest with all identifiers and constants abstracted, to determine the groups of similar ASTs. Then we compare each pair of ASTs in each group to find the Maximum Common Embedded Subtree (MCES) using an algorithm by Lozanno et al. [9]. A tree t1 is said to be embedded in another tree t2 if we can transform t2 in t1 by a series of edge contractions. An edge contraction is the action of merging together two nodes of the tree separated by one edge. An example of edge contraction is provided in Lozanno et al.’s MCES algorithm is based on edge contractions in the tree: the algorithm performs a parallel pre-order traversal of the AST and contracts edges that are different until only the common edges remain.1 In the implementation of the algorithm, the labels are carried by the edge whose child is the node rather than the nodes themselves. That is why contracting an edge means destroying the child node of this edge.

Table of contents :

1 Context and related work
1.1 The Linux kernel: large code base and collateral evolution
1.2 Related work: semantic patch inference
1.2.1 Safety and conciseness
1.2.2 Spdiff
1.2.3 LASE
1.2.4 Review
2 A mixed node-graph tree-based semantic patch inference
2.1 Data structures
2.1.1 Motivations
2.1.2 Merged CFG
2.1.3 Skimming
2.2 Extracting common patterns
2.2.1 Abstraction of patterns
2.2.2 Pattern mining
2.2.3 Abstraction rules
2.2.4 Usage
2.3 Building sequences of patterns
2.3.1 Ordering on CFG nodes
2.3.2 Growing pattern sequences
2.4 Producing semantic patches
3 The Spinfer program
3.1 Implementation
3.2 Current results
3.3 Performance analysis
4 Conclusion