Get Complete Project Material File(s) Now! »

## Tolerating Transient Fault Patterns

As mentioned previously, self-stabilization is convenient to tolerate any transient fault pattern. This section is devoted to the description of this fault tolerance scheme and its main variants (at the notable exception of those tolerating composite fault patterns that are described in Section 4.2).

Self-stabilization was defined by Dijkstra in [Dij74]. Intuitively, to be selfstabilizing, a distributed protocol must satisfy the two following properties:

Closure: there exists some configurations from which any execution of the distributed protocol satisfies the specification; and Convergence: starting from any arbitrary configuration, any execution of the distributed protocol reaches in a finite time a configuration that satisfies the closure property.

We can use a self-stabilizing distributed protocol to perform fault-tolerance due to the following observations. At the end of a burst of transient faults, the configuration of the system may be arbitrary (recall that we do not add assumptions on the nature or the span of the transient faults, hence any vertex may be Byzantine during the fault and arbitrarily modifies its state). Then, a self-stabilizing distributed protocol ensures that after a finite time (called the convergence or stabilization time), the distributed protocol recovers from itself a correct behavior (by convergence property) and keeps this correct behavior until there is no faults (by closure property).

In the same situation, a classical (i.e. not fault-tolerant) distributed protocol may never recover as well as a robust distributed protocol (since the fault may not be included in its tolerated range). In conclusion, a self-stabilizing distributed protocol is well-suited for tolerating any transient fault pattern. We state below the formal definition of self-stabilization that we use in the sequel of this thesis.

### Weakening Self-Stabilization

This section presents a (non-exhaustive) survey on variants that weaken selfstabilization using different approaches. The main goal of these variants is obviously to allow to solve a larger class of problem than self-stabilization.

Pseudo-stabilization First, we present pseudo-stabilization introduced by Burns, Gouda, and Miller in [BGM93]. They remove the guarantee of reaching a configuration from which the execution satisfies the specification. Instead, a pseudostabilizing distributed protocol guarantees that every execution has a suffix that satisfies the specification. Formal statement of this concept follows.

A distributed protocol is pseudo-stabilizing for specification spec if and only if starting from any arbitrary configuration every execution of has a suffix satisfying spec.

The main weakening with respect to self-stabilization is the following: when a portion of execution satisfies the specification, we have no guarantee that any subsequent execution satisfies the specification (due to the implicit weakening of the closure property). However, for an external observer, the execution is eventually correct that motivates pseudo-stabilization from a practical point of view when selfstabilization is not possible. For instance, [BGM93] proves that the alternating bit protocol (see Section 6.1.1) is pseudo-stabilizing.

Probabilistic stabilization Although this thesis focuses only on deterministic distributed protocols, we quickly discuss probabilistic stabilization. Probabilistic stabilizing distributed protocols are introduced in [Her90] as distributed protocols that ensure the convergence property with probability 1. Later, [DPBT04] distinguished a strong and a weak variant of probabilistic stabilization. The original definition of [Her90] corresponds to strong probabilistic stabilization while weak probabilistic stabilization weakens also the closure property since this latter is only probabilistic. For instance, the mutual exclusion distributed protocol provided by [DHT04] falls in weak probabilistic stabilizing distributed protocols category.

Weak-stabilization Gouda introduced in [Gou01] another weakening of self-stabilization called weak-stabilization. Starting from any arbitrary configuration, a weakly stabilizing distributed protocol only ensures that there exists at least one execution that reaches in a finite time a configuration from which any execution satisfies the specification. In other words, weak-stabilization weakens the convergence property by requiring only the existence of executions that satisfies the convergence property (and not that each execution satisfies it).

#### Enhancing Self-Stabilization

In this section, we present some of the variants of self-stabilization that exhibit stronger fault tolerance than self-stabilization (by requiring supplementary properties). Note that we do not consider here permanent or intermittent fault tolerance issues since they are described in details in the following section of this chapter. Snap-stabilization Bui et al. defined snap-stabilization in [BDPV07]. A snapstabilizing distributed protocol ensures that, starting from any arbitrary state, any execution of the distributed protocol satisfies the specification. In other words, a snap-stabilizing distributed protocol is a self-stabilizing distributed protocol with a stabilization time of 0. The main interest of snap-stabilization is the stronger safety properties provided to the user of the distributed system. Snap-stabilization was essentially studied through the propagation of information with feedback (PIF) problem [BDPV07], the token circulation problem [PV07], or the message forwarding problem [CDV09].

Super-stabilization Super-stabilizing distributed protocols have been introduced by Dolev and Herman in [DH97] as self-stabilizing distributed protocols that moreover satisfies a safety predicate when a topological change occurs during a legitimate execution. Note that constraints on topological changes are strong since they cannot occur during the stabilization phase (in this case, the distributed protocol may never stabilize). We can generalize this approach by defining classes of self-stabilizing distributed protocols that moreover satisfies a given predicate (either during or after the stabilization phase) in spite of the occurrence of new faults (of defined nature, duration, and/or span). For instance, [JT03] proposes a shortest path spanning tree construction that ensures moreover a property of route-preservation (intuitively, once a path is constructed towards a vertex, no message for this vertex can be lost).

**Tolerating Composite Fault Patterns**

This section aims to present existing fault tolerance schemes that are able to deal with composite fault patterns (see Section 2.4.2). Recall that a composite fault pattern gathers several classical fault patterns. In the following, we focus on tolerance to transient and permanent crash fault patterns (see Section 4.2.1) and to transient and intermittent Byzantine fault patterns (see Sections 4.2.2 and 4.2.3).

**Fault-Tolerant Self-Stabilization**

Fault-tolerant self-stabilization is the most natural way to gather robust and self-stabilizing distributed protocols properties. Indeed, this fault tolerance scheme defined simultaneously by [AH93] and by [GP93] ensures that distributed protocols satisfies closure and convergence properties of Section 4.1 in presence of any arbitrary transient and permanent crash fault pattern. In other words, a fault-tolerant selfstabilizing (FTSS for short) distributed protocol recovers a correct behavior in a finite time after transient faults in spite of crashes of a given proportion of vertices. We can state the formal definition of fault-tolerant self-stabilization .

**Problem and Related Works**

As self-stabilization is usually considered a hard property to satisfy, most related works used the state model (see Section 2.3.2) in which any vertex can determine the current state of every neighbors (and update its own state accordingly) in an atomic manner. Asynchronous message passing is a more realistic way, compared to the state model, for the communication of vertices in distributed systems. In such settings vertices communicate by exchanging messages, where sending and receiving message are two separate atomic actions (see Section 2.3.3). Transformers for distributed protocols from state model to message passing model, assuming the existence of FIFO communication channels, have been suggested, see e.g. [DIM93, Dol00]. At the core of those transformers are the data-link protocols, that permit to reliably exchange information between neighboring vertices in the message passing model. In addition, several self-stabilizing distributed protocols (e.g. [DT06, AAD+10]) that are directly written in the message-passing model use an underlying data-link protocol as a building block.

**Bounded Labelling Systems**

This section aims to present a second necessary tool for our atomic register simulation provided in Chapter 7. This tool is a bounded labeling system, that is a set of labels (a.k.a. time-stamps) enhanced with a total antisymmetric binary relation (in order to compare labels with each other) that always allows the computation of a new label greater than (according to the binary relation) any bounded-sized set of labels.

Section 6.2.1 presents the problem and motivates the need of a new bounded labeling system to tolerate constraints of self-stabilizing settings and Section 6.2.2 presents our solution to this problem.

**Problem and Related Works**

In any system, it is often useful to enhance data or events with some information to compare them. For example, date and time of creation and of last modification are always associate to files in order to compare versions. In a centralized system, the value of the clock is a good information to keep trace of time. Due to asynchronism (each vertex has its own clock and its own speed), we cannot use clock values to date data or events in a distributed system since we are not able to compare clock values from two different vertices. The key idea is to store an information that allows to deduce what event precedes the other. Israeli and Li defined in [IL93] time-stamps as “numerical labels which enable a system to keep track of temporal precedence relation among its data elements”.

**Table of contents :**

**1 Introduction **

1.1 Context of the Thesis

1.2 Thesis Contributions

**I Context **

**2 Model **

2.1 Distributed System

2.1.1 Characteristics

2.1.2 Advantages

2.2 Communication Graph

2.3 Models Used in this Thesis

2.3.1 Generic Model

2.3.2 State Model

2.3.3 Message Passing Model

2.4 Fault Taxonomy

2.4.1 Faults

2.4.2 Fault Patterns

**3 Taxonomy of Daemons **

3.1 Characterization of Daemons

3.1.1 Distribution

3.1.2 Fairness

3.1.3 Boundedness

3.1.4 Enabledness

3.2 Comparing Daemons

3.2.1 Comparing daemon classes

3.2.2 Preserving execution properties

3.2.3 The Case of the Synchronous Daemon

3.2.4 A map of classical daemons

3.3 Daemon Transformers

**4 Fault Tolerance **

4.1 Tolerating Transient Fault Patterns

4.1.1 Weakening Self-Stabilization

4.1.2 Enhancing Self-Stabilization

4.2 Tolerating Composite Fault Patterns

4.2.1 Fault-Tolerant Self-Stabilization

4.2.2 Byzantine Tolerant Self-Stabilization

4.2.3 Strict Stabilization

4.3 Summary

**II Atomic Register **

**5 Introduction of Part II **

5.1 Problem and Related Works

5.1.1 Problem

5.1.2 Related Works

5.1.3 Specification

5.2 Contributions of Part II

**6 Preliminaries **

6.1 Data-Link Protocol

6.1.1 Problem and Related Works

6.1.2 Specification

6.1.3 Lower Bounds

6.1.4 Optimal Solution

6.1.5 Correctness Proof

6.2 Bounded Labelling Systems

6.2.1 Problem and Related Works

6.2.2 Solution

**7 Atomic Register Simulation **

7.1 The ABD Simulation

7.2 The FTPS Simulation

7.2.1 Distributed Protocol

7.2.2 Proof of Correctness

7.2.3 Conclusion

**8 Conclusion of Part II **

8.1 Summary of Contributions

8.2 Concluding Remarks

**III Unison **

**9 Introduction of Part III **

9.1 Problem and Related Works

9.1.1 Problem

9.1.2 Related Works

9.1.3 Specification and Definitions

9.2 Contributions of Part III

9.3 Fault-Tolerant Self-Stabilization

**10 Impossibility Results **

10.1 General Results

10.1.1 Two and more Byzantine Faults

10.1.2 Unfair Daemon

10.2 Minimal Unison Related Results

10.2.1 Weakly Fair Daemon

10.2.2 Strongly Fair Daemon and Maximal Degree greater than 3

10.3 Priority Unison Related Results

10.3.1 Weakly Fair Daemon

10.3.2 Strongly Fair Daemon and Maximal Degree greater than 3

10.4 Summary of Impossibility Results

**11 Strictly Stabilizing Solution **

11.1 Strictly Stabilizing Solution

11.1.1 Distributed Protocol Description

11.1.2 Correctness Proof

11.2 Optimality of Convergence Time

11.2.1 Upper bound

11.2.2 Lower Bound

11.2.3 Conclusion

**12 Conclusion of Part III **

12.1 Summary of Contributions

12.2 Concluding Remarks

**IV Spanning Tree **

**13 Introduction of Part IV **

13.1 Problem and Related Works

13.1.1 Related Works

13.1.2 Specification

13.2 Contributions of Part IV

13.3 Containing Byzantine Faults in Self-Stabilization

13.3.1 Strict Stabilization

13.3.2 Strong Stabilization

13.3.3 Topology-Aware Stabilization

13.3.4 Discussion

**14 Two Case Studies **

14.1 Spanning Tree without Constraints

14.1.1 Strongly Stabilizing Distributed Protocol

14.1.2 Proof of Strong Stabilization

14.2 BFS Spanning Tree

14.2.1 Impossibility of Strong Stabilization

14.2.2 Topology-Aware Stabilizing Solution

14.2.3 Proof of Topology-Aware Strict Stabilization

14.2.4 Proof of Topology-Aware Strong Stabilization

**14.3 Summary **

**15 General Case **

15.1 Topology-Aware Stabilizing Solution

15.1.1 Distributed Protocol

15.1.2 Proof of Topology-Aware Strict Stabilization

15.1.3 Proof of Topology-Aware Strong Stabilization

15.2 Optimality of Containment Areas

15.2.1 Topology-Aware Strict Stabilization

15.2.2 Topology-Aware Strong Stabilization

15.3 Strong Stabilization

15.4 Summary

**16 Conclusion of Part IV **

16.1 Summary of Contributions

16.2 Concluding Remarks

**17 Conclusion **

17.1 Overview of Thesis Contributions

17.1.1 Part One: Context

17.1.2 Part Two: Atomic Register

17.1.3 Part Three: Unison

17.1.4 Part Four: Spanning Tree

17.1.5 Summary

17.2 Perspectives

**A Version française **

A.1 Contexte de la thèse

A.1.1 Généralités

A.1.2 Modèles et tolérance aux fautes

A.2 Registre atomique

A.2.1 Contexte

A.2.2 Contributions

A.2.3 Perspectives

A.3 Unisson

A.3.1 Contexte

A.3.2 Contributions

A.3.3 Perspectives

A.4 Arbre couvrant

A.4.1 Contexte

A.4.2 Contributions

A.4.3 Perspectives

A.5 Conclusion

List of Notations

**Bibliography **