Automating Assembly Code Matching

Get Complete Project Material File(s) Now! »

Error Report Debugging

An error report is a text message generated by software at the time of failure. The message contains some predefined strings, and more importantly, the values of some parameters concerning the running status of software. An error report serves two purposes: First, it notices users about the status of a running software; Second, it provides an important piece of information for developers to conduct software debugging. Out of the first purpose, the capability of generating error report is necessary for all software. And it is also practical to have the capability for both users and developers. Not surprisingly, the practise of generating, collecting and debugging error reports is universally adopted in all operating systems, e.g. Windows, Macintosh and Linux.
Windows Error Reporting (WER) [44, 97] is a post-mortem error debugging system that collects and processes error reports generated by the Windows operating system as well as over 7000 third-party applications from a billion machines. WER consists of both a client and a server service. The client service runs on users’ machines, monitoring the status of Windows operating system as well as applications on top of the system. Once an error occurs, the client service intercepts and reports the error message to the server service. Before sending, the error message is hashed and compressed, in order to avoid duplicated reports and to reduce the traffic load.
Due to the wide adoption of Windows system, WER receives millions of reports daily, which requires enormous engineering force to be able to fully take care of. Thus, one of the primary goals of WER is to help programmers prioritize and debug the most urgent issues reported from users. For this, WER groups the error reports into buckets based on a series of criteria, such as the crashing program name, the exception code, and the trapping instruction offset. Ideally, the error reports within a bucket correspond to a specific bug, or at least correlated bugs. The strategy of bucketing is essential to WER, as it helps developers concentrate their debugging effort. Later, a tool named ReBucket [35] proposed another method of bucketing for WER, which is to cluster crash reports based on call trace.
Similarly, on the Linux platforms, there is a kind of error report, called kernel oops, which is generated during the Linux kernel crash. Unlike WER, there is no systematic way on the Linux platforms to gather and analyze kernel oopses. Kernel oopses are kept on log files in the client machine. Each Linux distribution (e.g. Debian, Fedora) has its own tool set to extract and submit kernel oopses. In addition, each distribution has its own repository to collect kernel oopses.

Spectrum-Based Fault Localization

We first explain the basic idea behind spectrum-based localisation techniques. As examples, we then review two state-of-the-art techniques (Tarantula [64], Ochiai [5]) in detail.
Definition 2.4 (Program Element). The source code of a program P can be represented as a set of basic elements, where each element ei can be a statement, a block from the control flow graph, a function, a component, etc., i.e. P = {e1, e2,…en}.
Definition 2.5 (Faulty Element). A faulty element f is an element from a faulty program P, i.e. f 2 P, where the fault resides.
Definition 2.6 (Program Spectrum). A program spectrum [109] S is a mapping between a set of program elements denoted as E = {e1, e2,…en} and the result of a specific execution of the program (i.e. a passed or failed test case) denoted as R, i.e. S =hE,Ri.
A program spectrum is usually generated and collected after the execution of a test case. It can then be viewed as an execution trace of program, which represents the behaviours of the program under a certain circumstance. A collection of program spectra can give a global view of the behaviour of a program.
We give an example of a kind program spectrum called a program hit spectrum in Figure 2.1. On the left side, there is a matrix where each row corresponds to the involvement of each program element in an execution. The values in each column i represent the involvement of the element ei in the executions. For instance, the value at the starting position (1, 1) of the matrix is 1, which says that element e1 is involved in the execution that produces E1, while the value to its right at (1, 2) indicates that element e2 is NOT involved in the execution that produces E1. On the right side of Figure 2.1, there is a vector that represents the results (passed or failed) of each execution. For example, the value R1 is 0, which indicates that the execution that produces E1 passes without any error. While the value R4 is 1, which indicates the the execution that produces E4 fails. There are other forms of program spectra, which one can find in the empirical study conducted by Harrold et al. [52].

Global Sequence Alignment Algorithms

Given the above definitions, we first give the algorithms to calculate the optimal global sequence alignment. Since the algorithm to calculate the optimal local sequence alignment is quite similar with the global one, we explain it afterwards.
In order to explain the global sequence alignment algorithm, we introduce the function Optimal Prefix Alignment Value V(i, j) below.
Definition 2.20 (Optimal Prefix Alignment Value). Given two sequences S1 and S2, and a specific sequence similarity function ≠, we define the optimal prefix alignment value V(i, j) 2 R (1 Σ iΣ|S1|)^(1 Σ jΣ|S2|) , as the value of the optimal global sequence alignment between the prefixes cS1 = S1[1…i] and cS2 = S2[1… j], i.e. V(i, j) =≠(OG(S1[1…i],S2[1… j])).
The problem of finding the optimal global sequence alignment between two sequences S1 and S2, often reduces to calculating V(|S1|,|S2|), the optimal prefix alignment value between S1 and S2 (a sequence is a prefix of itself). The value of V can be calculated in a recursive manner, following the formulas below: (2.1) V(0,0) = 0.

READ  An anthropological exploration of soundscape in zoos: exoticism as a mediator of everyday experiences of nature

Applications of Sequence Alignments

We now discuss the origin and the applications about the concept of sequence alignment. Global sequence alignment was first proposed in 1970 by Needleman & Wunsch [100] in biology, in order to search for similarities in the amino acid sequence of two proteins. From the similarities of amino acid sequences, it is then possible to determine whether significant homology of function exists between the proteins. The homology implies the possible evolution of proteins. Later in 1981, Smith & Waterman [39] introduced the problem of local sequence alignment in biology. Both alignments are implemented based on the systematic study of dynamic programming by Belleman [25] in 1957. Finally, a number of authors have studied the question of how to construct a good scoring function for sequence comparison, to meet the matching criteria of different scenarios [11, 68].
Approximate string matching is another form of sequence alignment, which is the technique of finding strings that matches a pattern approximately (rather than exactly). Formally, approximate string matching can be defined as: Given a pattern string P = p1p2…pm and a text string T = t1t2…tn, find a substring Ti, j = ti ti+1…t j in T, which, of all substrings of T, has the smallest edit distance to the pattern P. One common instance of edit distance, called Levenshtein distance [82], is the minimal number of single-character edits (i.e. insertion, deletion or substitution) required to change one string into the other. Approximate string matching with Levenshtein distance as measure can reduce into local sequence alignment, since the deletion and insertion operations of string matching are equivalent to gap matches in sequence alignment and the substitution is equivalent to a mismatch.

Key Features of a Kernel Oops

A kernel oops [123] consists of the information logged by the Linux kernel when a crash or warning occurs. It is composed mostly of attribute: value pairs, in which each value records a specific piece of information about certain attribute of the system. The format and content of a kernel oops depend somewhat on the cause of the oops and the architecture of the victim machine. We focus on the x86 architecture (both 32-bit and 64-bit), which currently represents almost all of the oopses found in the kernel oops repository.
Figure. 3.1 shows a (slightly simplified) sample kernel oops1 from the repository. This example illustrates a typical kernel oops generated by x86 code in the case of a runtime exception. Some values are omitted (marked with …) for brevity. We highlight the key features, and explain them as follows: Oops description A kernel oops begins with a description of the cause of the oops. Our oops was caused by NULL-pointer dereference (line 1).
Error site The IP (Instruction Pointer) field indicates the name of the function being executed at the time of the oops, the binary code offset at which the oops occurred in that function, and the binary code size of that function. This information is also indicated by the value of the registers EIP for the 32-bit x86 architecture and RIP for the 64-bit x86 architecture. Lines 2, 9 and 26 each show that in our example oops, the error occurred within the function anon_vma_link.

Table of contents :

List of Tables
List of Figures
1 Introduction 
1.1 Motivation
1.2 Research Problem
1.2.1 Kernel Oops Comprehension
1.2.2 Kernel Oops Debugging
1.3 Contribution
1.4 Organization
2 Background 
2.1 Software Debugging
2.1.1 Bug Reproduction
2.1.2 Error Report Debugging
2.2 Software Fault Localization
2.2.1 Spectrum-Based Fault Localization
2.2.2 Program Slicing
2.3 Sequence Alignment
2.3.1 Definitions
2.3.2 Global Sequence Alignment Algorithms
2.3.3 Example
2.3.4 Local Sequence Alignment Algorithm
2.3.5 Applications of Sequence Alignments
2.4 Summary
3 All About Kernel Oops 
3.1 Background
3.1.1 Key Features of a Kernel Oops
3.1.2 Types of Kernel Oopses
3.1.3 Workflow around Kernel Oopses
3.2 Properties of the Raw Data
3.3 Correlation with External Information
3.4 Features Related to Kernel Reliability
3.5 Threats to Validity
3.6 Conclusion
4 Kernel Oops Debugging 
4.1 State-of-the-Art
4.1.1 Revisit Kernel Oops
4.1.2 Pinpointing the Offending Line
4.1.3 Properties of Oopsing Functions
4.2 Automating Assembly Code Matching
4.2.1 Definitions
4.2.2 Anchored Sequence Matching
4.2.3 Anchored Alignment
4.2.4 Optimization
4.3 Design Decisions
4.3.1 Anchor Point Selection
4.3.2 Scoring Function Design
4.3.3 Breaking Ties
4.3.4 Input Normalization
4.4 Evaluation
4.4.1 Experimental Data
4.4.2 Experimental Settings
4.4.3 Performance Benchmarking
4.4.4 Failure Case Analysis
4.5 Threats to Validity
4.6 Conclusion
5 Related Work 
5.1 Clone Detection
5.1.1 Definitions
5.1.2 Source Code Clone Detection
5.1.3 Binary Code Clone Detection
5.2 Summary
6 Conclusion 
6.1 Future Work
A Appendix A 
A.1 Introduction
A.1.1 Motivation
A.1.2 Problème de Recherche
A.1.3 Contribution
A.1.4 Organisation
A.2 Débogage de Kernel Oops
A.2.1 Définitions
A.2.2 L’alignment Ancré des Séquences
A.2.3 L’alignment Ancré
A.3 Conclusion
A.3.1 Perspectives


Related Posts