Why is Verification of Parallel Programs Hard?
First we consider the reasons why many researchers think that verification of parallel programs is hard:
• Added complexity; there is no magic method for designing a task decomposition and data distribution; if the wrong decomposition or distribution is chosen, poor performance results (due to too much communication between machines) and unexpected results can appear; for example, think of distributed graph algorithms  that mainly make the hypothesis of a sorted distribution of edges. Compared to the verification of sequential programs, there are three main inherent diﬃculties in verifying parallel programs:
1. There is not just one flow of execution but many (as there are multiple units of computation).
2. The interleaving of instructions can introduce data-races4 and unexpected results by the presence of non-determinism. A choice should be made between forbidding these cases (by raising a warning or an exception) or computing all possible traces of executions and results.
3. In distributed computations the use of a network for reading/writing values in diﬀerent mem-ories and primitives of communications are introduced. This makes the reading and the se-mantics of programs harder: synchronous/buﬀered sending can generate a deadlock5 and an unexpected result.
Tools for the verification of parallel codes exist but are not much used, mainly because they are usually inferior to their sequential counterparts. Most researchers on parallel computing are not accustomed to formal methods .
• Parallel programming is error-prone even for “Gurus” of parallel computing, some subtle errors can be introduced and only appear after a large number of program executions; for example, a lack of verification of the call of mpi’s collective operators mixed with asynchronous sending and multi-threading can crash the whole machine quickly . Parallel libraries are generally well documented but even experienced programmers can misunderstand these apis, especially when they are informally described.
• Too few powerful abstractions are available in the languages used for parallel program-ming . Parallel programming languages are often very low level , as each communication or synchronisation operation has to be managed in detail by the programmer. Verification be-comes much easier when the programmer can rely on powerful libraries to encapsulate complex behaviours — these generate less possible cases/traces of execution to prove. Work on algorithmic skeletons and high-level patterns is established but there is little knowledge of innovative parallel programming languages and tools for the verification of codes. Furthermore, it is diﬃcult to know when you can compose parallel codes and if the results are as intended. For example, how to call a mpi code within an openmp [[1 ]] one?
• Parallel code testing is hard. Many libraries (e.g. mpi, openmp, etc.) allow you to test your parallel programs on your personal computer by emulating parallelism using OS processes; but not all interleavings can be tested in this way and some combinations that generate a deadlock may be missed; only formal verification gives this assurance. Testing is a first step and without test try, it can be hard to know “where to go” in a proof of correctness.
We conclude that High Performance Computing (hpc) programming is hard because every detail of parallelism is left to the programmer. Large fractions of code are technical and program complexity becomes excessive, resulting in a truly incessant debugging and proof. Because hpc is widely used in research in science and engineering (these themes are present in hpc conferences), for the long term viability of this area it is absolutely essential that we have verification tools to help application developers gain confidence in their software.
Machine-checked Proof of Programs
Developing tools and methods to find logical bugs or to help programmers have “from the start” (or at a given time of the development) correct programs is an old but still pertinent area of research. For sequential (imperative, functional, object or real-time) algorithms and programs, many methods (and their associated tools) exist: programming through theorem proving, b method, model-checking, test, deductive verification, abstract interpretation, algebraic rules, etc.
A common and well accepted approach is conducting machine-checked proofs. These are formal deriva-tions in some given formal system, assuming that the underlying logic is coherent. Humans, even great mathematicians, are fallible. The use of a formal system for doing proofs on a machine forbids false proofs. For example, special cases that seem a priori trivially true (or very similar to other cases and therefore unattractive) are often forgotten about but later revealed to be false. The use of a formal sys-tem gives thus greater confidence in the results. Parallel programming tends to be much more complex than sequential computing. This means that rigorous program development techniques are all the more necessary.
Common Methods for Having Machine-checked Correct Programs
There are diﬀerent methods and tools for proving the correctness of programs and systems or for gen-erating machine-checked programs. Without being exhaustive, we present here some of the best known methods.
Testing, Debugging and Model-checking
This first class of methods we consider is generally not used for proving the correctness of programs but rather for finding bugs. Software testing  involves running a program for diﬀerent scenarios (inputs of the program) and comparing the traces of execution against a formal specification of expected results. Scenarios can be automatically extracted by a testing tool from the program (for example, when there is a division to test the program against a division by zero) or given by hand by the programmer. Some languages such as Python have a specific syntax for describing the expected results of a function (which is also used to document the programs). Testing can not prove the correctness of a given program but it is very useful for finding most of the small careless mistakes of programmers.
Verification through model checking  consists in: (1) defining a formal model of the system to be analysed; (2) expressing expected properties formally (generally in a temporal logic); (3) using automated tools to check whether the model fulfils the properties. This is done by calculating the state-space of the model representing all the diﬀerent configurations of the execution of the program. The state-space construction problem is that of computing the explicit representation of a given model from the implicit one. In most cases, this space is constructed by exploring all the states reachable through a successor function from an initial one. Building a state-space is the first step in verification by model-checking of logical properties. Temporal logics are mainly used to compare all traces of execution against a given formal specification. Systems that are to be analysed are mostly concurrent communicated “agents”, i.e. small programs. Model-checking can only find unexpected results for a given finite number of agents but not for any number of agents. However, model-checkers use diﬀerent techniques to reduce the state-space (or the time to compute it) and to check the logical formula. That allows the checking of more agents or more complicated systems (since less memory and time are needed in the verification).
Static Analysis, Abstract Interpretation and Symbolic Execution
This class of automatic methods is based on the analysis of a (full) source code, sometimes on the executable code (after compilation) and therefore performed without truly executing the code. These methods do not prove the “full” correctness of programs but rather prove that some properties hold during the entire execution of the programs. Those properties can be safe execution, absence of deadlock or (integer/buﬀer/…) overflow, termination of the computation, security issues, etc.
Static program analysers are mainly based on some kind of type systems. If the source code is valid for the type system then some simple properties hold during the execution of the program. Type sys-tems give some (polymorphic) types to elementary objects and sub-expressions of the code. Then, some specific rules are used to compose these types. ml, haskell, java, ada (and many others) are common examples of typed languages. There are other kinds of specific analysers which extend the traditional type systems of programming languages.
Abstract interpretation is a generalisation of type systems, which uses Galois connections and mono-tonic functions over ordered sets . It can be viewed as a partial execution of a computer program which gains information about its semantics by running an approximation of the execution. Approxi-mation is used to allow for vague answers to questions; this simplifies problems making them amenable to automatic solutions. One crucial requirement is to add suﬃcient vagueness so as to make problems manageable while still retaining enough precision for answering the important questions (such as “will the program crash?”). Abstract interpretation thus works on abstract domains of the data of the analysed program. When one chooses an abstract domain, one typically has to strike a balance between keeping fine-grained relationships, and high computational costs. Diﬀerent kind of abstractions exist such as convex polyhedra, congruence relations, etc.
Symbolic execution is a special case of abstract interpretation and works by tracking symbolic rather than actual values of the code and is used to reason about all the inputs that take the same path through a program. Mainly, true values are substituted by symbolic variables and a constraint solver is used to test assertions in the program or to test if a given property holds during a symbolic execution. For branching constructs, all feasible paths run symbolically. But the number of feasible paths in a program grows exponentially with an increase in program size. Thus the method is inappropriate for large programs.
Programming Through Theorem Proving and B Method
Machine-checked proofs (such as mechanized semantics of programming languages) are realised using theorem provers. A theorem is given to the system together with some tactics for solving the theorem using the logic rules of the system. Some theorem provers are fully automatic while others have special tactics to check automatically some simple goals. Several theorem provers (such as coq) are based on the Curry-Howard isomorphism which claims that “a proof is a program, the formula it proves is a type for the program”. In this way, proving a property is as giving a (functional) program that computes the property. For example, the property “for any list of objects, there exist a list with the same objects but arranged in an ordered fashion” corresponds to giving a function for sorting lists. Programming through theorem proving  is thus proving that the formal specification (a property) is valid and then extracting the program from the proof. The resulting program would be automatically correct (assuming a correct extraction) since it is taken from the mechanized proof. In contrast, top-down design of a program starts from an abstract formal specification. The specification is then refined into a more concrete form that incorporates design decisions reflecting a commitment to algorithmic and data representation. An executable code can be then extracted. The b method yields such a refinement technique of abstract machines. Correctness of abstract machines, mainly with logical invariants of the computations, are performed using a kind of Hoare logic (described below).
Advantages of Bridging Models for Correctness of Parallel Programs
One inherent diﬃculty in the development of scientific software is the reconciliation of the requirements that code be both correct and eﬃcient. Today, the number of parallel computation models and languages may exceed the number of diﬀerent architectures available. Most are inadequate because they make it hard to achieve portability, correctness and performance. Such systems are prone to deadlocks. Furthermore, the performance of such programs is typically diﬃcult to predict because of the interaction of large number of individual data transfers. Placing too much emphasis on correctness may result in an abstract, but ineﬃcient, programming model. Alternatively, striving for optimal eﬃciency can run the risk of comprising software correctness and can result in the employment of architecture-specific programming models. A common (but still insuﬃcient) solution is the use of a bridging model.
In computer science, a bridging model is an abstract model of a computer which provides a conceptual bridge between the physical implementation of the machine and the abstraction available to a programmer of that machine; in other words, it is intended to provide a common level of understanding between hardware and software engineers. A bridging model provides software developers with an attractive escape route from the world of architecture-dependent parallel software. A successful bridging model is one which can be eﬃciently implemented in reality and eﬃciently targeted by programmers; in particular, it should be possible for a compiler to produce good code from a typical high-level language. The term was introduced in Leslie Valiant’s 1990 paper “A Bridging Model for Parallel Computation” , which argued that the strength of the von Neumann model was largely responsible for the success of computing. The paper goes on to develop the bsp model as an analogous model for parallel computing.
There exist other tentatives of bridging models such as Logp [24, 73] (and many variants such as cgm , loggp, logpc ), clumps, e-bsp, d-bsp, sqm, etc. None are as eﬀective as bsp. For hybrid and hierarchical architectures, new bridging models such as the Hierarchical Hyper Clusters of Heterogeneous Processors (hihcohp) model  or the Multi-bsp model  have been designed. But these models are too complex in comparison with bsp: compare algorithms is thus much harder (even if in a low-level parallel model, some trick optimisations are easier to find) and especially, for our purpose, proof of correctness would be harder because parallel routines could be much harder (more side-eﬀects, unspecified behaviour etc.) to formalize.
The only sensible way to evaluate an architecture-independent model of parallel computation such as bsp is to consider it in terms of all of its properties, that is (a) its usefulness as a basis for the design and analysis of algorithms; (b) its applicability across the whole range of general-purpose architectures and its ability to provide eﬃcient, scalable performance on them; (c) its support for the design of fully-portable programs; and (d) software engineering tools such as those for correctness or debug can be easily adapt to programs of this bridging model.
Take for example, a proof of correctness of a gpu-like program. Although interesting in itself it cannot be used directly for clusters of pcs. A bridging model has the advantage that if a program is correct, then this is the case for “all” physical architectures. Note that it is also the case for portable libraries such as mpi but algorithm design would be clearly architecture independent, which will be not the case using a bridging model. Moreover, it is known and accepted that correctness of programs is more costly in terms of work than just programming and designing algorithms. Hence the choice in this thesis of the bsp bridging model to provide both portability for proofs of correctness and a model for algorithmic design and eﬃcient programs.
High-performance Routines and Oblivious Synchronisation
In some bsp libraries, message passing and drma routines have their high-performance counterpart. Those routines are less safe but can be more eﬃcient. They are unbuﬀered and nonblocking that is when a processor executes hp-send(v,i) to send a value v to processor i, then it is unspecified when the value v will be sent. In this way, if the program changes the value of v, it is unknown which value will be received by processor i. In case of a send(v,i), it is a copy of v that is sent and thus the program can then modify v for its own purpose without modifying the meaning of the sending. Note that unlike some “mpi send routines”, all “bsp’s send routines” are nonblocking in the sense that the sender does not wait until the receiver truly receives the value; the good reception is guaranteed by the bsp barrier, i.e. the end of the super-step. Note also that some “mpi send routines” are also asynchronous as well as reception: the famous isend and irecv where mainly a “wait” for a request is performed to know when the message is truly received. It is thus possible to program bsp algorithms using mpi but it is more diﬃcult and error prone since one has to manage all the waiting of data.
The oblivious synchronization is the high-performance pending of the traditional bsp synchronisation. It should be used if the programmer knows the number of messages each processor will receive in a super-step. Thus, in the oblivious synchronization each processor waits until “nmsgs” are received. This type of synchronization is cheap and much faster than the other one because no additional communication is needed to determine that processors can move on to the next super-step. To our knowledge, The pub library has been the first to propose this kind of synchronisation.
Table of contents :
1.1.2 Preuve vérifiée par ordinateur de programmes
1.1.4 Vérification de programmes parallèles
1.2 Programmation BSP impérative
1.2.1 Différents types d’opérations BSP
1.2.2 Choix pour BSP-Why
1.2.3 Erreurs fréquentes dans les programmes BSP
1.3 L’outil BSP-Why
1.3.1 Le langage BSP-Why-ML
1.3.2 Transformation des programmes BSP-Why-ML
1.4 Étude de cas
1.4.1 Algorithmes BSP basiques
1.4.2 Construction parallèle d’espaces d’états
1.5 Sémantiques formelles
1.5.1 Sémantiques à grand-pas, et à petit-pas
1.5.2 Sémantiques en Coq
1.5.3 Preuve de la transformation
1.6.1 Résumé des contributions
2.1 Context of the Work: Generalities and Background
2.1.1 Why Verify Parallel Programs?
2.1.2 Why is Verification of Parallel Programs Hard?
2.2 Machine-checked Proof of Programs
2.2.1 Common Methods for Having Machine-checked Correct Programs
2.2.2 Hoare Logic and Deductive Verification of Programs
2.3 Parallel Computing
2.3.1 Different Forms of Parallelism
2.3.2 The Bulk-Synchronous Parallelism
2.4 Verification of Parallel Programs
2.4.2 Advantages of Bridging Models for Correctness of Parallel Programs
2.4.3 Our Approach: Deductive Verification of BSP Programs
3 Imperative BSP Programming
3.1 Different Kinds of BSP Operations
3.1.2 Description of these Routines
3.2 Some Details About Some BSP Libraries
3.2.1 The Message Passing Interface (MPI)
3.2.2 BSPLIB Like Libraries for C
3.2.3 BSP Programming over Multi-cores and GPUs
3.2.4 BSP Programming in Java
3.2.5 Other Libraries
3.2.6 Which Routines for BSP-Why
3.3 Common Errors Introduced in BSP Programming
3.3.1 Common Errors and Bugs
3.3.2 These Errors in BSP Programs
3.4 Related Works
3.4.1 Parallel Libraries and Imperative Programming
3.4.2 Other Parallel Paradigms
4 The BSP-Why Tool
4.1 The BSP-Why-ML Intermediate Language
4.1.1 Syntax of BSP-Why
4.1.2 The Parallel Memory Model in Why
4.2 Transformation of BSP-Why-ML Programs: the Inner Working
4.2.2 Identification of Sequential Blocks
4.2.3 Block Tree Transformation
4.2.4 Translation of Sequential Blocks
4.2.5 Translation of Logic Assertions
4.2.6 Dealing with Exceptions
4.3 Dealing with Subgroup Synchronisation
4.3.1 Subgroup Definition
4.3.2 Transformation of Programs with the Subgroup Synchronisation
4.4 Related Work
4.4.1 Other Verification Condition Generators
4.4.2 Concurrent (Shared Memory) Programs
4.4.3 Distributed and MPI Programs
4.4.4 Proof of BSP Programs
5 Case Studies
5.1 Simple BSP Algorithms
5.1.1 Parallel Prefix Reductions
5.1.2 Parallel Sorting Algorithm
5.1.3 Mechanical Proof
5.2 Parallel State-space Construction
5.2.1 Motivations and Background
5.2.2 Definitions and Verification of a Sequential State-space Algorithm
5.2.3 Verification of a Generic Distributed State-space Algorithm
5.2.4 Dedicated Algorithms for Security protocols
5.3 Related Work
5.3.1 Other Methods for Proving the Correctness of Model-checkers
5.3.2 Verification of Security Protocols
5.3.3 Distributed State-space Construction
6 Formal Semantics
6.1 Big-steps Semantics
6.1.1 Semantics Rules
6.1.2 Co-inductive Semantics
6.1.3 Adding the Subgroup Synchronisation
6.2 Small-steps Semantics
6.2.1 Semantics Rules
6.2.2 Co-inductive Semantics
6.2.3 Equivalence Between the Semantics
6.2.4 Adding the Subgroup Synchronisation
6.3 Semantics in Coq
6.3.1 Mechanized Semantics in Coq
6.3.2 Memory Model
6.3.3 Environment of Execution
6.3.4 Semantics Rules
6.3.5 Adding the Subgroup Synchronisation
6.4 Proof of the Translation
6.4.1 Program Correctness
6.4.2 Equivalence Between Elements of the Semantics
6.4.3 Elements of Proof in Coq
6.5 Related Work
7.1 Summary of the Contributions
7.2 Future Work and Perspectives
7.2.1 Close Future Work
7.2.2 Long Term Perspectives