Automating Assembly Code Matching

Get Complete Project Material File(s) Now! »

Error Report Debugging

An error report is a text message generated by software at the time of failure. The message contains some predefined strings, and more importantly, the values of some parameters concerning the running status of software. An error report serves two purposes: First, it notices users about the status of a running software; Second, it provides an important piece of information for developers to conduct software debugging. Out of the first purpose, the capability of generating error report is necessary for all software. And it is also practical to have the capability for both users and developers. Not surprisingly, the practise of generating, collecting and debugging error reports is universally adopted in all operating systems, e.g. Windows, Macintosh and Linux.
Windows Error Reporting (WER) [44, 97] is a post-mortem error debugging system that collects and processes error reports generated by the Windows operating system as well as over 7000 third-party applications from a billion machines. WER consists of both a client and a server service. The client service runs on users’ machines, monitoring the status of Windows operating system as well as applications on top of the system. Once an error occurs, the client service intercepts and reports the error message to the server service. Before sending, the error message is hashed and compressed, in order to avoid duplicated reports and to reduce the traffic load.
Due to the wide adoption of Windows system, WER receives millions of reports daily, which requires enormous engineering force to be able to fully take care of. Thus, one of the primary goals of WER is to help programmers prioritize and debug the most urgent issues reported from users. For this, WER groups the error reports into buckets based on a series of criteria, such as the crashing program name, the exception code, and the trapping instruction offset. Ideally, the error reports within a bucket correspond to a specific bug, or at least correlated bugs. The strategy of bucketing is essential to WER, as it helps developers concentrate their debugging effort. Later, a tool named ReBucket [35] proposed another method of bucketing for WER, which is to cluster crash reports based on call trace.
Similarly, on the Linux platforms, there is a kind of error report, called kernel oops, which is generated during the Linux kernel crash. Unlike WER, there is no systematic way on the Linux platforms to gather and analyze kernel oopses. Kernel oopses are kept on log files in the client machine. Each Linux distribution (e.g. Debian, Fedora) has its own tool set to extract and submit kernel oopses. In addition, each distribution has its own repository to collect kernel oopses. It is not so intuitive to collect a large amount of error reports from the Linux kernel for study. One of the reasons is that the Linux kernel is relatively stable, as the quality of the Linux kernel code has improved significantly, e.g., the number of faults per line has dropped over the past 20 years [32, 103]. It is relatively rare, if not unlikely, to observe the crash of the Linux kernel. Some researchers then applied a technique called fault injection to artificially inject faults into the software system. For instance, Gu et al. [46] performed a series of fault injection experiments on the Linux kernel of version 2.4.20, to simulate the Linux kernel crash. Their goal was to analyze and quantify the response of the Linux kernel in various crashing scenarios. Most recently, Yoshimura et al. [132] did an empirical study on the Linux kernel as well with fault injection. Their goal was to quantify the damage that could be incurred at the crash time of the Linux kernel.
Other than above work, to our knowledge, there is few work that studies kernel oopses from the real-world, rather than the ones generated through fault injection. Since September 2013, the kernel oops repository ( [12] that can receive kernel oops reports from all Linux deployments has been revived by Red Hat. The repository opens a new opportunity for a systematic study of kernel oopses. We intend to fill this gap with this thesis.

Software Fault Localization

This thesis addresses the issue of debugging the Linux kernel oopses, which ultimately is to help kernel developers quickly locate the faults in Linux kernel that cause the kernel oopses. Therefore, in this section, we cover the general background of software fault localization, and present some software fault localization techniques. First, we classify the basic and essential terminology [14] in this domain as follows:
Definition 2.1 (Failure). A failure is an event that occurs when the delivered service deviated from the correct service.
A failure is the phenomenon observed by the users. A fault is the defect residing in the source code that eventually leads to the failure of the system. During the manifestation of a failure, the system encounters some errors.
We illustrate the relationship between these three concepts in the following figure. The symbol 99K indicates that the causal relationship, i.e. the concept on the left hand side triggersthe occurrence of the concept on the right hand side. But the relationship is NOT strong, which is why we put it in dash, i.e. the presence of the left one does not always imply the occurrence of the right one. For instance, the existence of faults in a software system does not always render the system into failure. Most of the failures only manifest in certain corner cases.

Spectrum-Based Fault Localization

We first explain the basic idea behind spectrum-based localisation techniques. As examples, we then review two state-of-the-art techniques (Tarantula [64], Ochiai [5]) in detail.
Definition 2.4 (Program Element). The source code of a program P can be represented as a set of basic elements, where each element ei can be a statement, a block from the control flow graph, a function, a component, etc., i.e. P = {e1, e2,…en}.
Definition 2.5 (Faulty Element). A faulty element f is an element from a faulty program P, i.e. f 2 P, where the fault resides.
Definition 2.6 (Program Spectrum). A program spectrum [109] S is a mapping between a set of program elements denoted as E = {e1, e2,…en} and the result of a specific execution of the program (i.e. a passed or failed test case) denoted as R, i.e. S =hE,Ri.
A program spectrum is usually generated and collected after the execution of a test case. It can then be viewed as an execution trace of program, which represents the behaviours of the program under a certain circumstance. A collection of program spectra can give a global view of the behaviour of a program.
We give an example of a kind program spectrum called a program hit spectrum in Figure 2.1. On the left side, there is a matrix where each row corresponds to the involvement of each program element in an execution. The values in each column i represent the involvement of the element ei in the executions. For instance, the value at the starting position (1, 1) of the matrix is 1, which says that element e1 is involved in the execution that produces E1, while the value to its right at (1, 2) indicates that element e2 is NOT involved in the execution that produces E1. On the right side of Figure 2.1, there is a vector that represents the results (passed or failed) of each execution. For example, the value R1 is 0, which indicates that the execution that produces E1 passes without any error. While the value R4 is 1, which indicates the the execution that produces E4 fails. There are other forms of program spectra, which one can find in the empirical study conducted by Harrold et al. [52].


Global Sequence Alignment Algorithms

Given the above definitions, we first give the algorithms to calculate the optimal global sequence alignment. Since the algorithm to calculate the optimal local sequence alignment is quite similar with the global one, we explain it afterwards.
In order to explain the global sequence alignment algorithm, we introduce the function Optimal Prefix Alignment Value V(i, j) below.
Definition 2.20 (Optimal Prefix Alignment Value). Given two sequences S1 and S2, and a specific sequence similarity function ≠, we define the optimal prefix alignment value V(i, j) 2 R (1 Σ iΣ|S1|)^(1 Σ jΣ|S2|) , as the value of the optimal global sequence alignment between the prefixes cS1 = S1[1…i] and cS2 = S2[1… j], i.e. V(i, j) =≠(OG(S1[1…i],S2[1… j])).
The problem of finding the optimal global sequence alignment between two sequences S1 and S2, often reduces to calculating V(|S1|,|S2|), the optimal prefix alignment value between S1 and S2 (a sequence is a prefix of itself). The value of V can be calculated in a recursive manner, following the formulas below: (2.1) V(0,0) = 0.

Applications of Sequence Alignments

We now discuss the origin and the applications about the concept of sequence alignment. Global sequence alignment was first proposed in 1970 by Needleman & Wunsch [100] in biology, in order to search for similarities in the amino acid sequence of two proteins. From the similarities of amino acid sequences, it is then possible to determine whether significant homology of function exists between the proteins. The homology implies the possible evolution of proteins. Later in 1981, Smith & Waterman [39] introduced the problem of local sequence alignment in biology.
Both alignments are implemented based on the systematic study of dynamic programming by Belleman [25] in 1957. Finally, a number of authors have studied the question of how to construct a good scoring function for sequence comparison, to meet the matching criteria of different scenarios [11, 68].
Approximate string matching is another form of sequence alignment, which is the technique of finding strings that matches a pattern approximately (rather than exactly). Formally, approximate string matching can be defined as: Given a pattern string P = p1p2…pm and a text string T = t1t2…tn, find a substring Ti, j = ti ti+1…t j in T, which, of all substrings of T, has the smallest edit distance to the pattern P. One common instance of edit distance, called Levenshtein distance [82], is the minimal number of single-character edits (i.e. insertion, deletion or substitution) required to change one string into the other. Approximate string matching with Levenshtein distance as measure can reduce into local sequence alignment, since the deletion and insertion operations of string matching are equivalent to gap matches in sequence alignment and the substitution is equivalent to a mismatch.

Table of contents :

List of Tables
List of Figures
1 Introduction 
1.1 Motivation
1.2 Research Problem
1.2.1 Kernel Oops Comprehension
1.2.2 Kernel Oops Debugging
1.3 Contribution
1.4 Organization
2 Background 7
2.1 Software Debugging
2.1.1 Bug Reproduction
2.1.2 Error Report Debugging
2.2 Software Fault Localization
2.2.1 Spectrum-Based Fault Localization
2.2.2 Program Slicing
2.3 Sequence Alignment
2.3.1 Definitions
2.3.2 Global Sequence Alignment Algorithms
2.3.3 Example
2.3.4 Local Sequence Alignment Algorithm
2.3.5 Applications of Sequence Alignments
2.4 Summary
3 All About Kernel Oops 
3.1 Background
3.1.1 Key Features of a Kernel Oops
3.1.2 Types of Kernel Oopses
3.1.3 Workflow around Kernel Oopses
3.2 Properties of the Raw Data
3.3 Correlation with External Information
3.4 Features Related to Kernel Reliability
3.5 Threats to Validity
3.6 Conclusion
4 Kernel Oops Debugging 
4.1 State-of-the-Art
4.1.1 Revisit Kernel Oops
4.1.2 Pinpointing the Offending Line
4.1.3 Properties of Oopsing Functions
4.2 Automating Assembly Code Matching
4.2.1 Definitions
4.2.2 Anchored Sequence Matching
4.2.3 Anchored Alignment
4.2.4 Optimization
4.3 Design Decisions
4.3.1 Anchor Point Selection
4.3.2 Scoring Function Design
4.3.3 Breaking Ties
4.3.4 Input Normalization
4.4 Evaluation
4.4.1 Experimental Data
4.4.2 Experimental Settings
4.4.3 Performance Benchmarking
4.4.4 Failure Case Analysis
4.5 Threats to Validity
4.6 Conclusion
5 Related Work 
5.1 Clone Detection
5.1.1 Definitions
5.1.2 Source Code Clone Detection
5.1.3 Binary Code Clone Detection
5.2 Summary
6 Conclusion 
6.1 Future Work
A Appendix A 
A.1 Introduction
A.1.1 Motivation
A.1.2 Problème de Recherche
A.1.3 Contribution
A.1.4 Organisation
A.2 Débogage de Kernel Oops
A.2.1 Définitions
A.2.2 L’alignment Ancré des Séquences
A.2.3 L’alignment Ancré
A.3 Conclusion
A.3.1 Perspectives


Related Posts