The challenges of analyzing dynamic programming languages

Get Complete Project Material File(s) Now! »

Combining string length and summarization

Let us assume we run a modified version of the string search program (Listing 2.4), where we search for a character c > 100 in the string s, consisting in a random string containing only the characters « a » to « d ». We would like the analysis to infer that the search will be unsuccessful and that i = len(s). This is possible by combining the string length and string summarization domains in a reduced product on top of a relational abstract domain. We will need to introduce reduction rules for the evaluation of s[i]. Intuitively, the length domain will be able to prove that the accesses are valid, and the summarization one will give a more precise result based on the contents of the string.
There is, however, an issue related to the concretizations. Let us assume that our program works over a string s. If both abstract domains are used and share the underlying numerical abstract domain, this latter domain will contain variables len(s) and ord(s). The concretization of the string length domain will provide a concrete state over s and ord(s) (since it preserves variables it does not handle). The concretization of the string summarization domain will provide a concrete state over s and len(s). In their current form, the concretizations cannot be combined. It would be tempting the drop the auxiliary variables and intersect the resulting concrete state, but this approach introduces imprecisions, as shown in the example below.

Abstract syntax tree (AST)

Similarly to Astrée, Mopsa performs an analysis by induction on the syntax, rather than by iterating over a control-flow graph (the approaches were compared in Remark 2.68). Mopsa uses a single AST, which is extended when needed to support new languages. This way, it keeps the high-level structure of parsed programs. A downside of this approach is the high number of AST nodes an analysis needs to support. This is mitigated by dynamic translations from language-specific AST nodes to language-agnostic ones (e.g., Python and C loops shown in Section 3.2.4), which will be analyzed by so-called “universal” abstract domains. For now, Mopsa supports the following languages :
• a large subset of Python 3, using a dedicated parser originally developed by Fromherz et al. [54].
• most of C, using Clang’s parser.
• a contract annotation language close to ACSL [7] used to model library functions for C programs [121].
• universal, a simple language close to the one described in Chapter 2.
We illustrate the inner workings of our approach on the case of loop statements. We start by defining elementary expressions and statements used in our examples (including support for basic while loops of Imp). We explain how a domain would compute an approximate fixpoint for these loops. We extend our approach to the cases of Python and C loops.

Elementary expressions and statements

The syntactic elements of Mopsa include variables, constants, types, binary operators, expressions and statements. All these types are defined as records consisting of multiple fields. The main field of interest is the kind, which is an extensible variant type, meaning that any OCaml module can further extend it. Additional fields include the program ranges for expressions and statements, and the type for variables and expressions.

A domain handling while loops

We show excerpts of the domain computing loop fixpoints in Listing 3.5. The next section (3.3) will explain in detail the signature of abstract domains. exec defines the local transfer function over statements for this domain. It takes three arguments: the statement to analyze, a manager referencing the global transfer functions, and the analysis’ state.
The transfer function of S_while calls the lfp function. Then, it calls recursively the whole analysis (using man.exec, described later in Section 3.3.2.2) to filter the reached fixpoint by the negation of the loop’s condition lfp performs an accelerated fixpoint computation. It starts by computing the effects of the loop’s condition and body on the input state, by calling recursively the whole analysis. mk_block and mk_assume are helper functions creating statements whose stmt_kind is respectively S_block and S_assume. It the result is stable, we return the previously computed state5. Otherwise, lfp calls itself recursively, on a widened state.
We can intuitively notice that the analysis progresses: recursive calls are performed on strict sub-nodes of the S_while statement. This argument would also be used to prove termination of the analysis.

Table of contents :

Abstract
Résumé
Acknowledgments & Remerciements
I Background
1 Introduction
1.1 Software bugs and what can be done about it
1.1.1 A first approach: testing
1.1.2 An impossibility theorem
1.1.3 Deductive program verification
1.1.4 Symbolic execution
1.1.5 Model checking
1.1.6 Static analysis by abstract interpretation
1.2 The challenges of analyzing dynamic programming languages
1.3 Contributions & outline
2 Static Analysis by Abstract Interpretation
2.1 A Toy Imperative Language, Imp
2.2 Semantics of Imp
2.2.1 Semantics of expressions
2.2.2 Semantics of statements
2.2.3 Comparing states
2.3 Inferring ranges of Imp variables
2.3.1 The interval domain
2.3.2 Concretization
2.3.3 Interval transfer functions
2.3.4 Abstract semantics of expressions
2.3.5 Abstract semantics of statements
2.3.6 Basic statements
2.3.7 Conditionals
2.3.8 Terminating loop analyses
2.3.9 Improving the analysis with congruences
2.3.9.1 Deriving the semantics
2.3.9.2 Cooperation between congruences and intervals
2.4 Extending Imp and its analyses
2.4.1 Extending Imp with strings
2.4.2 Using ghost variables to track string length
2.4.3 Breaking out of a loop
2.4.4 Relational invariants
2.4.5 Summarization of string content
2.4.6 Combining string length and summarization
2.5 Defining modular concretization functions
2.5.1 Generic approach
2.5.2 String summary domain
2.5.3 String length domain
2.5.4 Combining both concretizations
2.6 Conclusion
II Base Abstractions
3 Mopsa
3.1 Related work
3.1.1 Infer
3.1.2 TAJS
3.1.3 Frama-C
3.1.4 Astrée
3.1.5 Framework of Keidel et al.
3.2 Abstract syntax tree (AST)
3.2.1 Elementary expressions and statements
3.2.2 A domain handling while loops
3.2.3 Extending the AST with Python and C loops
3.2.4 Dynamically rewriting Python and C loops
3.3 Domains
3.3.1 Defining analyses by combining domains
3.3.2 Domain signature
3.3.2.1 Domain type and lattice operations (lines 24-33)
3.3.2.2 The need for a manager (lines 2-15)
3.3.2.3 Flow, wrapper of the global abstract state
3.3.2.4 Cases, postconditions and evaluations (lines 18-19)
3.3.2.5 Transfer functions on expressions and statements (lines 36-38) 60
3.3.2.6 Utilities (lines 41-44)
3.3.3 The simplified case of non-relational domains
3.3.4 Reduced products and their pitfalls
3.3.5 Communication between domains
3.4 Hooks
3.5 Formalization
3.6 Conclusion
4 Abstracting Dynamic Memory Allocation
4.1 The recency abstraction
4.1.1 Motivation
4.1.2 Concrete semantics
4.1.3 The recency abstraction
4.1.4 Abstract semantics
4.1.5 Concretizations
4.1.5.1 Recency abstraction
4.1.5.2 Heap abstraction
4.2 Variable policies for the recency abstraction
4.3 Abstract garbage collection (AGC)
4.4 Related work
4.5 Conclusion
5 Abstracting Containers
5.1 Dynamic arrays
5.1.1 Array operations
5.1.2 Length abstraction
5.1.3 Summarization abstraction
5.1.4 Reduced product
5.1.5 Variation: abstracting sets
5.2 Dictionaries
5.2.1 Dictionary operations
5.2.2 Whole smashing
5.3 Related work
5.4 Conclusion
III Pure Python Programs
6 Concrete Semantics of Python
6.1 Concrete state
6.2 Core language
6.2.1 Literals
6.2.2 Variables
6.2.3 Nominal types
6.2.4 Structural types (attributes)
6.2.5 Subscript
6.2.6 Conditionals
6.2.7 Loops
6.2.8 Exceptions
6.2.9 With context manager
6.2.10 Function declaration
6.2.11 Class declaration
6.2.12 Decorators
6.2.13 Calls
6.2.14 Unary operators
6.2.15 Binary operators
6.2.15.1 Arithmetic operators
6.2.15.2 Comparison operators
6.2.16 Other binary operators
6.3 Builtin objects
6.3.1 Object
6.3.2 Functions and methods
6.3.3 Type
6.3.4 Booleans
6.3.5 Integers
6.3.6 Range objects
6.3.7 Containers
6.3.8 Iterators
6.3.9 super
6.3.10 Generators
6.4 Correctness
6.4.1 Tests from previous works
6.4.2 CPython’s tests
6.4.3 Summary of the conformance tests
6.5 Comparison with JavaScript
6.6 Related work
6.7 Conclusion
7 Type Analysis
7.1 Differences with a type system
7.2 Non-relational type analysis
7.2.1 Abstract addresses
7.2.2 Environment abstraction
7.2.3 Heap Abstraction
7.2.4 Additional abstractions
7.2.4.1 Flow tokens
7.2.4.2 Containers
7.2.4.3 Stateless abstractions close to the concrete semantics
7.2.5 Functions
7.2.6 Full abstraction
7.3 A relational reduced product bringing polymorphism
7.4 Interaction with Python’s type annotations
7.5 Implementation
7.5.1 Configuration
7.5.2 Optimizations & extensions
7.5.2.1 Exception abstraction
7.5.2.2 Towards a partially modular function analysis.
7.6 Experimental evaluation
7.6.1 Benchmarks
7.6.2 Comparison with other tools
7.6.2.1 Competing tools
7.6.2.2 Performance and precision
7.6.2.3 Soundness evaluation
7.6.2.4 Summary of the comparison
7.6.3 Impact of the allocation policy and of the abstract garbage collector
7.7 Related work
7.8 Conclusion
8 Value Analysis
8.1 Value analysis as a refinement of the type analysis
8.2 Experimental evaluation
8.2.1 Value-sensitivity
8.2.2 Allocation-site policy choice
8.2.3 Abstract garbage collector
8.2.4 Selectivity of the analysis
8.3 Scaling relational analyses using packing
8.4 Conclusion
IV Mixing Python and C
9 Interoperability Mechanisms between Python and C
9.1 A toy example using Python’s API
9.1.1 Counter module, viewed from Python
9.1.2 Counter, viewed from C
9.1.3 Module import
9.1.4 Class initialization
9.1.5 Counter creation
9.1.6 Counter increment
9.1.7 Counter access
9.1.8 Building the module
9.1.9 What can go wrong?
9.1.10 Common bugs at the boundary
9.2 Other Python/C interoperability mechanisms
9.2.1 Ctypes
9.2.2 Cffi
9.2.3 Swig
9.2.4 Cython
9.3 Conclusion
10 Concrete Multilanguage Semantics
10.1 Multilanguage state
10.1.1 Python state
10.1.2 C state
10.1.3 Combined state
10.1.4 Handling Python exceptions in C
10.2 Boundary functions
10.2.1 Python to C boundary
10.2.2 C to Python boundary
10.3 C call from Python
10.4 Python call from C
10.5 Builtins of the API
10.5.1 Integer conversions
10.5.2 Operations on containers
10.5.3 Generic converters: PyArg_ParseTuple, Py_BuildValue
10.5.4 Class initialization: PyType_Ready
10.6 Threats to validity
10.7 Related work
10.8 Conclusion
11 Multilanguage Value Analysis
11.1 Abstract domain
11.2 Transfer functions
11.2.1 Boundaries
11.2.2 C call from Python
11.2.3 Conversion from a C long to a Python integer
11.2.4 Tuple conversions
11.3 Examples
11.4 Concretization & soundness
11.5 Implementation
11.5.1 Configuration
11.5.2 Build setup
11.6 Experimental evaluation
11.6.1 Corpus selection
11.6.2 Analysis results
11.7 Related work
11.7.1 Native code analysis
11.7.2 Multilanguage analyses
11.7.3 Library analyses
11.8 Conclusion
V Conclusion & Future Work
12 Conclusion & Future Work
Bibliography