Related work: semantic patch inference
Many attempts of semantic patch inference have been made, and are summarized into Kim’s work ; but the two most relevant tools are LASE and Spdiﬀ, which are discussed in Sections 1.2.2 and 1.2.3. To better understand semantic patch inference, Section 1.2.1 first defines two important notions.
Safety and conciseness
Andersen’s theory of atomic patching  defines two very important notions to evaluate the results of a semantic patch: safety and conciseness. We consider a simplified language that only has functions and integers. Here is a set of “before” and “after” code fragments.
1. f(1, g(1)) → f(7, g(1))
2. f(1, 2) → f(7, 2)
3. g(f(3, 2)) → g(f(7, 2))
4. g(3) → g(3) (this term is not modified)
5. g(2) → g(9)
If we consider the semantic patch f(x, y) ⇒ f(7, y) where x and y are metavariables, it applies to situations 1, 2 and 3 and produces correct results, and does not apply to situations 4 and 5: this patch is safe. On the other hand, the semantic patch g(x) ⇒ g(9) is not safe because it would provide wrong results on situation 4. Thus safety is always defined relative to a set of changed (or unchanged code). The precise definition proposed by Andersen is more complicated, but the global idea is that a patch is safe, it will not perform unwanted modifications.
The semantic patches f(x, y) ⇒ f(7, y) and g(2) ⇒ g(9) are concise because they implement all the modifications of the modified code base. The conciseness is thus related to the level of abstraction of a patch and the proportion of changes covered by the patches. For instance, f(1, y) ⇒ f(7, y) is not concise since it does not cover situation 3; to implement all the modification, we would need an other patch f(3, y) ⇒ f(7, y). Having two rules instead of one to describe the same modifications is less concise. The conciseness of a patch is also relative to the modified code base.
Safety and conciseness are distinct properties and have no logical implications between them. The ideal goal of semantic inference is to infer patches that are both safe and concise. But there exists a tradeoﬀ between safety and conciseness when inferring semantic patches. Indeed, the more metavariables in the semantic rules, the more concise it typically becomes but also the more risk there is that it becomes unsafe. On the other hand, a trivial solution to semantic patch inference is to list all the modifications without using any metavariables, but that would not help to understand the transformation at a higher-level and would not be robust, in the sense that the inferred semantic patch could not be applied to code diﬀerent from the examples it has been inferred from.
Andersen and Lawall have already developed a semantic patch inference tool called Spdiﬀ . The semantic patches it produces feature both … and metavariables. Spdiﬀ’s workflow is made to try to satisfy safety and conciseness, using algorithms derived from the above theory of atomic term patching . It is suggested that Spdiﬀ be used with small amounts of code chosen by the programmer; Spdiﬀ is good at finding context to modifications (context that diff does not provide) from a small set of code but does not scale up well when confronted with thousands of lines of code, as the algorithm that determines the right level of abstraction (to be the most concise while remaining safe) enumerates all the possibilities of creating metavariables in order to choose a good one.
Spdiﬀ works by first identifying maximal semantic patterns, i.e. semantic patches with no added or removed nodes, which match the code before the modifications, then trying to pair pieces of modifications to this maximal semantic pattern. A side-eﬀect of this technique is the output of irrelevant, too abstract context in the semantic patch (for instance *X6 X5 …) consisting only of metavariables not related to those in the modified code. We can see this eﬀect in Listing 4, as well as the apparition of devm_ioremap_resource in the semantic patch; it was added like context of the modification but it is too specific and might undermine the robustness of the semantic patch.
Listing 4 – Output of Spdiﬀ on the same commit 323de9ef of the Linux kernel described by the semantic patch of listing 1.
2 e x p r e s s i o n X21 ; e x p r e s s i o n X26 ; X21 ; struct p i n c t r l _ d e s c X22 ; i d e n t i f i e r X12 ;
3 X24 *X25 ; e x p r e s s i o n X23 ; e x p r e s s i o n X14 ; X7 X8 ; e x p r e s s i o n X9 ;
4 e x p r e s s i o n X5 ; X6 ;
6 X6 *X5;
10 X25 – > X12 = d e v m _ i o r e m a p _ r e s o u r c e ( X23 , X14 );
12 X21 = p i n c t r l _ r e g i s t e r (& X22 , X23 , X25 );
14 – return -[X26 ];
15 + return PTR_ERR (X21 );
Nevertheless, SmPL has many strong points as a result language for inference: metavariables can stand for any-type expressions (expression X14) or have a specific type (struct pinctrl_desc X22). Metavariables can occur multiple times in the semantic patch, and this is very important for added pieces of code since Spdiﬀ uses what was in the code before modifications to describe the added code: the X21 in X21=pinctrl_register(&X22,X23,X25); is the same X21 as in + return PTR_ERR(X21);.
While the underlying theory of Spdiﬀ is correct, the implementation chosen for the tool leads to some drawbacks in its practical usage. Indeed, some code structures like if(!X1) cannot be inferred by Spdiﬀ at the moment. Moreover, because of its internal workflow of first identifying the patterns in the “before” code then adding the modifications, we have observed empirically that in the semantic patches produced by Spdiﬀ, the context tend to include irrelevant patterns of context code and the modified patterns tend to cover only a part of the modifications.
Another interesting approach for semantic patch inference is LASE (Kim et al., ). LASE operates on Java code and the authors have also chosen to design their own dedicated rule application tools to apply the semantic patches that they infer. Listing 5 is an example of a semantic patch produced by LASE. The metavariables in this semantic patch are u$0:FieldAccessOrMethodInvocation (meaning that this metavariable is either a field access or a method invocation) and v$0. There is no … in LASE semantic patches; all the statements in the semantic patches have to be contiguous.
Listing 5 – Example of inferred edit script cited from  written using a SmPL-like syntax.
1 Iterator v$0 = u$0 : F i e l d A c c e s s O r M e t h o d I n v o c a t i o n . values (). iterator ();
2 while ( v$0 . hasNext ())
3 – MVAction action = ( MVAction )v$0 . next ();
4 – if( action . m$0 ())
5 + Object next = v$0 . next ();
6 + if( next i n s t a n c e o f MVAction )
7 + MVAction action = ( MVAction ) next ;
The major phases of LASE’s patch inference are:
• perform a tree-diﬀerencing on the ASTs of the code before and after to find tree edit scripts, which are lists of atomic tree operations such as “remove a node” to transform the AST of the code “before” to the AST “after”;
• determine the longest common edit script while abstracting some identifiers (names of variables, methods, etc.);
• determine the context of the modification using first code clone detection  on the code of the methods impacted by the modifications, then filtering the results of the clone detection by extracting the largest subtree common to the ASTs of these methods.
Code clones detectors use a large variety of techniques based on tokens, Abstract Syntax Trees (AST) or Control Flow Graphs (CFG) to identify contiguous blocks of code that are repeated over a code base, the meaning of “repeating” being either in the sense of text equality or a more abstract semantic equality. Code clone detection is largely used in software engineering and has been extensively studied; a general overview of the topic is provided by Roy et al. .
The common subtree extraction of LASE operates on a AST whose nodes are incomplete statements in Java code; in fact this AST can be seen as a non-cyclic CFG since for example, an if node contains the condition of the if and has two children (the condition can be true or false), a variable assignment in a sequence is a node containing the variable assignment and has one child.
The method used in LASE considers first the changes, and then tries to add the context. Thus, in the output, the modified patterns would tend to be all taken into account but the context would tend to be minimized. In a sense, LASE’s inference is less sophisticated than Spdiﬀ’s:
• LASE only abstracts names of variables (u$0, v$0) whereas Spdiﬀ’s can abstract expressions;
• LASE only produces patches that capture contiguous pieces of code and happen at most once per method declaration.
Kim et al. do not provide any theoretical insight comparable to Spdiﬀ’s preliminary work , but include a empirical study of LASE’s performance based on 24 examples of semantic patches which suggests that this workflow is eﬀective. Particularly, LASE’s semantic patches are not safe because the precision of the tool is not 100%. Nevertheless, the conciseness seems to be quite good when looking at the recall scores that tend to 100%.
Any attempt on producing an inference tool depends on a higher-level transformation specification language. The inference features are limited by the features of the specification language. That is why SmPL is a good choice as a transformation specification language since it already features a vast choice of constructions to fine-tune the transformation specification. Indeed, SmPL and Coccinelle have already been extensively used in the Linux kernel to perform collateral evolutions, as shown by Lawall et al.  and by the significant collection of semantic patches available on the Coccinellery website. Using SmPL as the output language for our tool Spinfer is thus a logical choice.
The strong points of Spdiﬀ are its context detection and the concept of sequentiality of patterns thanks to … in the semantic patches it produces, which has no equivalent in LASE. The strong points of LASE are its eﬃciency since it uses sequence and token-based matching to identify the common patterns. The goal of the new tool we propose, Spinfer, is to combine the strengths of these two approaches: using a eﬃcient workflow inspired by LASE to produce semantic patches in SmPL having the same expressiveness as Spdiﬀ’s. Rather than focusing only on safety or conciseness, Spinfer’s approach would rather be result-oriented in the sense that its primary objective would be to always produce a semantic patch to account for the modifications in its input. If it cannot produce a safe semantic patch, it would rather output an unsafe patch along with informations on why it is unsafe.
A mixed node-graph tree-based semantic patch inference
The title of this section sums up the characteristics of Spinfer’s semantic patch inference: it works with the CFG and considers the CFG nodes as ASTs, and the main inference is tree-based since it is performed on the CFG nodes. During the reading of this section, one can refer to Figure 1 for an overview of the diﬀerent phases of the workflow of Spinfer.
When analyzing the input code, we have to choose whether to represent it as an Abstract Syntax Tree (AST), a Control Flow Graph (CFG) or another data structure. This choice has a great impact on what is possible to analyse. Indeed, because LASE works exclusively with the AST, it would not be able to oﬀer the … feature of Spdiﬀ requires the use of a CFG, since the AST does not oﬀer the semantics of execution paths. Because of this, Spinfer uses the CFG as its primary data structure to represent the code it analyses. The nodes of the CFG are roughly the instructions of C code; for instance an if instruction would have its node (the condition of the if is stored in the node) and two successors in the graph, one for the case where the condition is true, and the other when it is false. A for instruction would have one node for it self, one successor which is the first instruction of the body of the loop, and two predecessors: the instruction before the for in the code and also the last instruction of the body of the loop.
Because we analyse modified code, we actually need to build two CFGs: one for the code “before” and one for the code “after”. The problem is that with keeping two CFGs to represent modified code, the modifications in the code are not aligned: we do not know which nodes are present in the two CFGs and which nodes have been modified. This is the same problem with a modified source code file: we want to know which lines have been modified from the two versions of the file. Our solution described in Section 2.1.2 is to use the information of diff executed on the two source files used to make the two CFGs to merge the CFGs of the “before” and “after” versions of the code.
Furthermore, we want to make our data structure more compact to reduce as much as possible the future cost of the pattern identifying algorithm, to avoid the pitfall of Spdiﬀ not terminating in a reasonable amount of time. Indeed the source code files Linux drivers files passed as input of the program would have lengths around 1,000 LOC. Our solution described in Section 2.1.3 is to remove all the nodes of the CFG that we suppose will not be part of the context of any semantic patch using a code clone detector.
The input of our program is a set of pair of source code files. Each pair of files are the two versions of a same source code file, “before” and “after” modification. In the next sections, we will describe how to handle one of these pairs of files; but the same operations are repeated for every pair of files given as input to Spinfer.
Identifying and aligning the diﬀerences between two source code file is non-trivial. We preferred to leave this task to the traditional UNIX diff, which indicates the line numbers which have been modified in the code “before” and “after”. Using the information gathered during the parsing of both versions of the code and the result of the diff, we are then able to tag the nodes of both CFG with the following rules:
• if a node in the “before” CFG is associated with a line that diff reports as being removed, we tag the node with -;
• if a node in the “after” CFG is associated with a line that diff reports as being added, we tag the node with +;
• if a node is not associated with any line that diff reports as being removed or added, we tag the node with 0.
Next, we merge the two CFGs into a single one, based on this tag information. The underlying assumption of this algorithm is that since nodes tagged with 0 are not modified, they appear in both CFGs but they will appear only once in the merged CFG. On the other hand, if a line of code is modified, it will be tagged by – in the “before” CFG and tagged by + in the “after” CFG. Then, the two versions of the node will appear in the merged CFG, each version linked to its correct predecessors and successor who also appear in the merged CFG, whether they are modified or not. Figure 2b illustrates the behavior of this algorithm.
The algorithm is implemented in a single parallel traversal of both the “before” and “after” CFGs (see Algorithm 1). The general principle is the following: as long as we traverse nodes that are not modified, the parallel traversal is synchronized between the two graphs. But when for instance we encounter a removed node on the “before” CFG that is not present (by definition) in the “after” CFG, we stop our traversal of the “after” CFG while continuing in the “before” CFG until we find in the “before” CFG the node at which we stopped at in the “after” CFG. We are sure to reach this point because if the node we stopped at in the “after” CFG is not modified, then we are sure it will also appear in the “before” CFG. Then, we resume our synchronized parallel traversal.
Another solution for aligning the diﬀerences of the CFGs could have been to use a tree-diﬀerencing algorithm such as GumTree by Falleri et al.  to align the diﬀerences of the AST of the code “before” and “after” and then transformed this merged AST into a merged CFG.
We now dispose of a merged CFG for every modified source code file. But in the worst-case scenario when each file features only one occurrence of the modification (for instance two modified LOC out of a thousand), we would pass onto our later analysis a significant amount of code that is not related at all to the semantic patch we want to produce. Our goal is to get rid of this irrelevant code while keeping, in addition to the modified code, all the code that might be part of the context of a semantic patch. However, we suppose that the semantic patch we want to produce will match several occurrences in the code we have; to achieve this, the relevant context code should be repeated several times throughout our codebase. Thus this context code we want to keep will appear in the output of a code clone detector. The same idea is used by Kim et al.  to produce an over-approximation of the context of a modification.
Table of contents :
1 Context and related work
1.1 The Linux kernel: large code base and collateral evolution
1.2 Related work: semantic patch inference
1.2.1 Safety and conciseness
2 A mixed node-graph tree-based semantic patch inference
2.1 Data structures
2.1.2 Merged CFG
2.2 Extracting common patterns
2.2.1 Abstraction of patterns
2.2.2 Pattern mining
2.2.3 Abstraction rules
2.3 Building sequences of patterns
2.3.1 Ordering on CFG nodes
2.3.2 Growing pattern sequences
2.4 Producing semantic patches
3 The Spinfer program
3.2 Current results
3.3 Performance analysis