Importance of the Shared State
e shared state of a system is a central notion for both state space exploration and parallelization. A model checker systematically generates runs (→) of the distributed program to check all possible interleavings of the processes in the system. For this, it requires control over the state of the network, the only shared state among processes. e order in which processes perform communications, aﬀecting the shared network state, induces a corresponding run, and therefore partial order → of events.
To illustrate this, consider a program with processes p1,p2, and p3 where processes p1 and p2 each send a message to p3 and then perform some local computation, and process p3 on the other hand, expects to receive two messages from any process. e execution of this program can lead to two diﬀerent runs shown in Figure 2.1. In the first, p1 issues its send (send1) before p2 (send2) and thus it matches the first receive of p3 (recv), forcing send2 to match the second receive as shown in Figure 2.1a. e other alternative is shown in Figure 2.1b. Here, the order is inverted and p2 issues send2 before p1, matching the first receive and forcing p1 to match the second, resulting in a completely diﬀerent run →.
We can infer from this, that a requirement for the model checker is to have a mecha-nism to intercept the networking operations before they modify the state of the network. To steer the execution of the program and cover every possible interleaving, the modifi-cations of the network state should be postponed until all the processes announced their intended communication operations, to consider all the matching possibilities.
e shared state is equally important for the parallel execution of the user processes during simulation. ese can interact with the simulated platform at any point of the execution, and in many cases these actions involve modifying shared portions of the simulation state, as previously shown with the example of Figure 2.1. erefore, to main-tain the coherence of the parallel simulations, all potential operations on the shared state should be atomic. Moreover, to guarantee the reproducibility of the simulations these op-erations must always happen in the same order, otherwise diﬀerent interleavings might change the matching of source and destination of messages and thus result in diﬀerent runs. Again, the requirements concerning of the shared state become a critical point to achieve the proposed goals of this thesis.
Shared State in SimGrid
We now analyze the current architecture of SimGrid with respect to the requirements presented in the previous section. We show that the main shortcoming of its design is the lack of encapsulation of the simulation’s shared state, that is sca ered across the whole so ware stack and thus diﬃcult to control and manipulate. It is important to recall that according to SimGrid’s original requirements this is not a problem at all. e limitations become apparent only a er considering the possibility of executing the user processes in parallel, or when trying to execute all possible runs of a program as required for the model checker. In the following subsections we discuss in detail these problems.
Dispersed Shared State
e data structures that compose the shared state of the simulation are sca ered across the whole so ware stack. In particular, the network state is deeply disaggregated among the simulation core, the virtualization layer, and the communication API as shown in the Figure 1.5 on page 66. e state of the network’s applicative layer is contained in the communication APIs. e state of the execution contexts, the synchronization primitives (i.e the condition variables), and the association of the actions in the simulation core with the condition variables blocking the processes waiting for these are located in the SIMIX module. Finally, the state of the resources, such as the CPUs, or the state of the network’s transport layer is handled by SURF.
is is also reflected in the way the communications are handled. To simulate the exchange of a message between two given processes, the networking API first updates the internal state of the network, then it creates the condition variables to block the pro-cesses participating in the communication (in SIMIX), next it creates the actions at the simulation core (SURF), associating them with the condition variables to wait until com-pletion of the actions. A erwards, once the simulation core determines that the actions have finished, the networking code cleans up all the associated data structures. Each of these operations corresponds to a diﬀerent level of abstraction : the network, the virtu-alization environment, and the simulation core, however they are all performed by the API code.
Almost all the data structures that the API manipulates are shared across the en-tire simulation. Under sequential simulation this is not an issue as at any time only one user process is in execution, and thus there is no risk of race conditions. However, this design posses serious diﬃculties to fulfill the atomicity requirement of the parallel ex-ecution. e problem can be clearly visualized in Figure 1.4 in page 65. e execution contexts that run the code of the user, also run the functions of the communication APIs (USER+API+SIMIX arrows) that aﬀect the dispersed shared data structures. If we allow the concurrent execution of these contexts it would require fine-grained locking across the entire so ware stack to avoid race conditions and maintain consistency. is would be both extremely diﬃcult to get right, and prohibitively expensive in terms of perfor-mance. Moreover, each communication API implements its own logic to aﬀect the state of the network. So each API would require its own locking mechanism with li le possible logic factorization between APIs. Finally, even if these diﬃculties were solved to ensure the internal consistency of the simulator, the parallel execution with this design would hinder the reproducibility of simulations, because the scheduling ordering would vary between simulations, yielding to possibly many diﬀerent runs →. All these reasons explain that distributing the code modifying the simulation’s shared state across the execution contexts of the user code clearly hinders the possibility of running these contexts in parallel eﬃciently.
La of Control Over the Executed Run
A model checker has to systematically generate every possible run → of the system. As explained in Section 2.1, this requires a transition interception mechanism capable of blocking the processes before they modify the network state. is permits to postpone the decisions about the matching sources and destinations of the messages, once every process announced their transitions and thus all the possibilities can be considered.
e main diﬃculty to implement such a mechanism in SimGrid is again the lack of proper control over the network state. is might be surprising at first, as SimGrid is a simulator and should also provide a mechanism to intercept the communication actions of the processes to simulate the network transfers. However, the problem arises from the slightly diﬀerent requirement of this functionality. A simulator’s goal is to block the processes involved in a communication until the simulation core determines that the corresponding simulated time has elapsed. erefore, the interception mechanism only provides a mean to delay processes, but not to decide the matching of sources and destinations, that instead is determined by the natural issuing order of the network op-erations. Note that this perfectly fits the requirements of simulation, recall that the goal is to generate one possible run that is subject to the restrictions imposed by the platform, and the current design complies with it.
SimGrid provides several communication APIs to the user, each tailored for a partic-ular type of application. Despite the apparent external diﬀerences, internally the hand-shake protocol and its interception mechanism varies li le from one user API to another. e diﬀerences are minor, and in essence they all consist in a procedure to determine the source and destination of the message, a condition variable to block the involved processes, and the creation/destruction of an action in the simulation core. Nevertheless, the communication APIs re-implement the same functionality mixed with code specific to each, leading to an unnecessary complex code base that is hard to modify and main-tain. With this scheme, a model checker capable of verifying programs wri en for one of the APIs of the simulator would require a specific transition detection mechanism.
Assumptions on the S eduling Order
SimGrid is designed as a scientific measurement tool, and strives to provide deter-ministic and reproducible simulations. When non-deterministic executions are possible, SimGrid always takes the same choices for the sake of reproducibility. However, the lack of variation of the scheduling order has lead to an implementation that is sensitive to it. e code assumes in many places, specially in the communication APIs, that certain events always execute in the same order. is is clearly not true when executing the user code in parallel or when performing a state space exploration where all possible schedules are considered.
In this section we introduce SIMIX v2.0, a new virtualization layer that replaces the SIMIX module to address the requirements of dynamic verification and parallelization. SIMIX v2.0 aims to encapsulate the simulator’s shared state, and provides a mechanism to guarantee that all potential modification operations on this shared state are race free. ese two characteristics greatly simplify the implementation of both the parallel user code execution under simulation, and the model checking exploration algorithm. e design of SIMIX v2.0 starts from the observation that the services required by the user processes from the simulation stack are similar to those provided by any classical oper-ating system (OS) : processes, inter-process communication, and synchronization primi-tives. erefore it is reasonable to design a new virtualization module that incorporates and encapsulates these three services to solve the aforementioned shortcomings on the shared state.
Mastering the Operating System
System programming design practices are not only useful as a guideline for the ar-chitecture of a simulator, they are also important to optimize its performance. e inter-action of the simulator with the operating system of the host involves issuing expensive system calls that might go unnoticed at first, but they produce a performance hit when scaling up to larger simulation instances.
An interesting case is the virtualization for the user processes, that is in the critical path of the simulator. As mentioned before, the user code runs in execution contexts that emulate a cooperative multitasking environment entirely in user space. Using the POSIX’s ucontexts, the execution is transferred from one ucontext to another using a swapcontext function, which is passed a pointer to the stack to restore and a pointer to a storage where the current stack should be saved. At the first glance, this function runs entirely in user space without requiring intervention of the OS kernel, but it turns out that this is not true. Indeed, POSIX allows to specify a diﬀerent signal mask for each ucontext, which induces an operation involving a system call during the swap. Since we do not need this feature, SimGrid now oﬀers an alternative ucontext implementa- tion that is free of system calls. Because the swap routine modifies specific registers, it is architecture dependent and it has to be programmed in assembly language. For the mo-ment this option is only available for x86 and x86_64 hardware, and other architectures fall back to the standard ucontexts.
Performance Comparison With Previous Version
It is reasonable to think that the modifications introduced in this chapter can aﬀect the simulator’s performance. A er all, state encapsulation usually increases the overhead to access it, and in the case of SimGrid, we introduced a new layer named requests to mediate any interaction with the simulated platform. In this section we try to compare the performance of a previous version of SimGrid based on the old design, with one that includes the modifications introduced in this chapter. Moreover, we measure the speedup obtained by the custom implementation of execution contexts presented in Section 2.5.
Table of contents :
1 State of e Art
1.1 Distributed Systems
1.1.1 Distributed Systems Taxonomy
1.1.2 Distributed Algorithms
1.1.3 Model of a Distributed System
1.2.1 Simulation in Science and Engineering
1.2.2 Parallel Discrete Event Simulation
1.2.3 Simulation of Distributed Systems
1.2.4 State of the Art of Distributed System Simulators
1.2.5 Elements of Simulation
1.3 Model Checking
1.3.2 e State Explosion Problem
1.3.3 Soware Model Checking
1.3.4 State of the Art of Soware Model Checkers
1.3.5 Model Checking Distributed Programs
1.3.6 Model Checking and Simulation
1.4 e SimGrid Simulation Framework
1.4.1 Workflow Overview
1.4.2 SimGrid Architecture
1.4.3 Simulation Algorithm
1.5 Detailed Objectives of is esis
2 Bridging Efficient Simulation & Verification Teniques
2.1 Importance of the Shared State
2.2 e Shared State in SimGrid
2.2.1 Dispersed Shared State
2.2.2 Lack of Control Over the Executed Run
2.3 Other Considerations of SimGrid’s Architecture
2.3.1 API-specific Code
2.3.2 Assumptions on the Scheduling Order
2.4 SIMIX v2.0
2.4.1 Strictly Layered Design
2.4.2 Kernel-mode Emulation
2.5 Mastering the Operating System
2.6 Performance Comparison With Previous Version
3 SimGridMC: A Dynamic Verification Tool for Distributed Systems
3.2 e Model
3.3 e Properties
3.5 e Exploration
3.6 Stateless Model Checking and Backtracking
3.7 Coping With e State Explosion Problem
3.8 Dynamic Partial Order Reduction
3.9 Formalization of the Network Model in SimGrid
3.10 Overall Network Model
3.11 Formal Semantics of Communication Primitives
3.12 Independence eorems
3.13 Implementing Higher-Level Communications APIs
3.14 Comparison to ISP and DAMPI
3.15 Experimental Results
3.15.1 SMPI Experiments
3.15.2 MSG Experiment: CHORD
4 Parallelizing the Simulation Loop
4.1 Parallel Execution of User Code
4.2 Architecture of the Parallel Execution
4.3 Analysis of the Cost of Simulations
4.4 An Efficient Pool of Worker reads
4.4.1 e Control of the Worker reads
4.4.2 Task Assignment
4.5 Experimental Seings
4.6 Cost of read Synchronization
4.7 Evaluation of the Parallelization Speed-Up
4.8 Comparison to OverSim
5 Conclusions and Future Work
5.2 Future Work