Communication Standards for Agent Fault Tolerance

Get Complete Project Material File(s) Now! »

TibFit and Chameleon

tibfit TibFit [67] is a protocol for the tolerance of arbitrary faults in wireless sen-sor networks. It uses a trust index for quantifying the reliability of each sensor in the network. The index is a real number between zero and one calculated through a learning process. The process starts with the maximal confidence level (of one) and then increases or decreases the value (while remaining between zero and one) of this level depending on the accuracy of the evaluated sensor behaviour over time. The index is used to determine the validity of these sensors in a weighted vote that is designed to help the system achieve a consensus by giving more credit to sensors that have proven reliable in the past. To enable the use of a voting protocol, each event is assumed to be located within the range of a well-defined group of sen-sors. After each assessment, the indices of the relevant sensors are updated. This approach allows the protocol to cover various cases of errors such as temporary or permanent sensor failures, as well as malicious sensors. TibFit is an interesting ex- ample for us because it allows components of a distributed system – and therefore potentially agents of a multi-agent system – to tolerate arbitrary faults collectively through a vote.
chameleon Chameleon [56] is a collection of tools for fault tolerance in a net-worked environment – several applications working together – and provides three types of entity. Daemons are attached to each node of the system for communication and local support for other entities of Chameleon. Then, entities called ARMORs are used for the implementation of specific fault tolerance techniques (voter AR-MOR, for example). The third type of entity consists of managers who are used for the supervision of the Chameleon system. Although not a multi-agent architec-ture itself, this distributed network for fault tolerance is interesting for our study because:
• its ARMORs are tools that are available for use in different applications, re-sulting in different levels of reliability. This flexibility allows on the one hand, choosing from different tools those that are better suited to the current appli-cation, and on the other hand adjusting the balance between the computation speed and the level of fault tolerance – knowing that normally adding addi-tional tools increases the computation cost.
• one objective of this infrastructure is to provide fault tolerance for “off the shelf” applications, which means that this system is transparent for the appli-cation developers, a feature we would also like to offer through our safety net approach. In other words, we aim to minimise the level of intrusion of our fault tolerance mechanisms so that the programmer can focus on desired sys-tem behaviours. This transparency for applications means that the Chameleon tools are responsible for the tolerance to faults that are unexpected for the de-velopers of the final system.
This last statement brings us to an important observation on the concept of “un-foreseen fault” with respect to the point of view differences. Taking the example of Chameleon, the faults that are “unforeseen” for application developers are not necessarily unexpected for those who have implemented the underlying platform and tools. Similarly, we will see that for our safety net approach we provide mech-anisms that can be considered as “foreseen” faults without it being a violation of the concept of “unforeseen fault” for the final developer.

Mission Data System

The system complexity and the constraints linked to the fact that the response time can be very large in space missions determined NASA to use a goal-based control system instead of the usual commands [91]. The essential difference identified by the authors of the paper between a command and a goal is that a command is linked to a moment in time and does not easily allow the verification of its per-manent effects. This also makes it difficult to verify the conflicts between different commands. In the proposed system, called Mission Data System (MDS), goals are represented as constraints on state variables over time intervals. Then, the veri-fication of conflicts and inconsistencies is reduced to a comparison between the constraints on shared variables and their time intervals. Taking the example of a drone, if a command to “avoid hazardous area” is launched, it will be easier to find a conflict with “follow target X” when the target enters the danger zone if the two are represented as constraints (the position of the drone and the next movement), compared to the case when the two are represented as individual commands. In the same time, if the area is classified as “dangerous” between the instants t1 and t2, the verification can also conclude that there is no conflict if for example the tar-get comes in the area after time t2. The actions to perform are deducted from the differences between the current state of the variables and the desired state.
Fault tolerance is included in the system naturally as error conditions are treated in the same way as normal states. Moreover, the states do not need to be explicitly and accurately described, as it suffices to define them only with respect to the ob-servable main states. One possibility that is closer to our concept of “unforeseen fault” is to describe the “normal” behaviour or “acceptable” of the system. An er-ror is detected when the observations on the system do not match the expected behaviours. We consider this natural inclusion of fault tolerance in the design spec-ifications useful for reliable systems and it can be used to handle cases of “unfore-seen faults”. Compared to our approach, the MDS does not discuss the distribution into several entities – the multi-agent design in our case – which we consider very important for fault tolerance.

Recovery Blocks

In the early days of software fault tolerance, an enriched program [54] and system [88] structure was proposed for allowing error detection and recovery. The idea is to include regular tests on the outcome of program execution and include alternative solutions for the situations when the original code did not produce the expected results. For this, the programs are segmented into recovery blocks. The normal code becomes a succession primary blocks which are each tested using an acceptance test. For each primary block, one or more alternate blocks are provided for the situations when the acceptance tests fail. Should the test of a primary block fail, the alternate blocks are executed one by one until the test is successful. If none of the alternate blocks for a primary block produce the required results, control is passed at a higher level where similar measures may apply. Each alternate block is applied as if the previous blocks of the same recovery block were never applied. To ensure this property, all non-local variables1 are tagged when modified using a boolean flag, while their original values are stored in a stack. When a primary or alternate block fails, any modifications that it operated on these variables are undone. Should more specific recovery measures be needed, dedicated procedures can be defined and triggered by the same mechanisms as the automatic variable recovery.
The types of errors that are covered by this technique are generic and of interest for our approach:
• errors in the block that are detected by the acceptance tests.
• failure to terminate, caught by a timeout.
• detection inside the block by an implicit error detection mechanism (e.g. divi-sion by zero).
• the failure of an inner recovery block.

A Case for Automatic Exception Handling

Cabral and Marques [16] offer an insight in the way exceptions are used in Java and .NET and conclude that exceptions are treated lightly by the programmers:
• generic exceptions that are difficult to properly handle and recover are thrown.
• generic catching mechanisms are provided, resulting in a poor recovery (caus-ing the program to continue in a corrupt state). There are even cases when errors are not caught at all, allowing the program to crash even from minor errors.
• providing “proper” exception handling decreases productivity and can have negative effects on the overall software development project.
• providing “proper” exception handling can be challenging and even con-tribute to the introduction of new errors.
They go on to make “A Case for Automatic Exception Handling”2 [17], draw-ing a parallel with the introduction of garbage collections and memory allocation. The idea is to improve software quality and robustness by better covering excep-tion cases and also ease the programmer’s task by minimising their error-handling inputs.
Their solution combines exceptions with an execution similar to the recovery blocks approach discussed here in Sec. 2.1.5. The programmers have the possibility to let the platform handle exceptions or provide specific handlers. The platform handling, however, is ensured through exception-specific actions – which can in-clude throwing a new exception to be handled by the higher level, i.e. the caller of the caller – provided by the programmer in a separate configuration file. This helps diminish the programmer’s task when writing the bulk of the application but still requires his or her involvement and concern for specific, foreseen, cases. At run-time, an execution section producing an exception can be ran multiple times, each time applying a different handler, until recovery is successful or the last handler – “Log&Abort” – is reached. A transactional model ensures that after each exception handler is executed the application state is restored to the initial condition so that the code can be ran again.
Their study on exceptions shows that there are cases where fault handling is poorly done and can result in a system crash or even continuing in an inconsistent state. This means that even errors that were foreseen – for example because the language would normally force the programmers to provide a specific handler – become unforeseen as they are not treated or not treated correctly in the finished application. Furthermore, there are also situations when the programmers could use the aid of the platform for handling certain types of error. Our goal is to pro-vide a development framework (platform, language and design requirements) that allow the programmer to rely on the platform for the automatic handling of at least some of the runtime exceptions. The safety net in this case is used in a conscious manner by the programmer who either throws exceptions knowing that they are handled by our mechanisms, or simply does not provide generic, empty or possibly wrong handlers, knowing that the platform will take care of the concerned excep-tions. Note, however, that as the authors of the cited studies, we too acknowledge the limits of providing a completely generic mechanism for handling exceptions, thus we need to integrate in the language the necessary features that facilitate the recovery, e.g. goals with satisfaction tests.

READ Genres of oral tradition in selected Yorùbá movies

Defensive Programming

The software engineering technique called defensive programming requires the pro-grammers to systematically cover all possible cases, even if this may seem redun-dant. While this technique does bring robustness benefits, it does so by relying heavily on the judgement of the programmer who is forced to add numerous tests to ensure the correct values for all variables. More tests means more code and this comes with the increased risk of errors. This technique is thus outside the scope of our work but constitutes an interesting example of expensive and yet not guaran-teed fault tolerance technique.

Design by Contract and Executable Specifications

design by contract The contract programming paradigm was introduced by Meyer with the Eiffel programming language [74, 75]. The idea is to require the programmer to systematically specify the conditions to check, but without the com-plexity of the defensive programming approach. These conditions (annotations) are assertions, to which the programmer associates a truth value and which have their own semantics (not necessarily the same as the language). In general, this semantics corresponds to boolean expressions with first order logic quantifiers. This program- ming paradigm is used not only to systematically test during the execution (and thus in a way provide a means to elegantly perform defensive programming), but also to analyse the code. One can indeed, in certain cases, link the contracts to an automatic prover or a static analysis tool. There are three types of assertion:
• Precondition: verified before an operation, for example a function call, which will not be performed if the assertion is not valid.
• Postcondition: verified after an operation.
• Invariant: is an assertion that needs to hold permanently during the entire program execution or more locally (e.g. in a loop).
Contract programming is a popular paradigm as it increases the robustness of software and also reduces the debugging time. Various programming languages contain an annotation facility in order to comply with the paradigm, for example SPARK [7] for annotating Ada code.
executable specifications Another approach is represented by the use of executable specifications [41] for increasing software reliability. The goal in this case is to identify errors and deviations in the development process from the user intent in order to correct them early in the application life-cycle. More recently, Samimi et al. [98] extend the application of the executable specifications to runtime, thus obtaining a use similar to the contracts, in an approach called Plan B. The speci-fications are used to check the postconditions after executions. In case of failure in the execution (through a RunTimeException, e.g. an ArrayIndexOutOfBounds or NullPointerException in Java) or if the postconditions are not as required, instead of halting the execution, the execution falls back on the specifications which are used to try and provide an alternative solution. The authors aim to:
• increase software reliability by introducing redundancy through the speci-fications and catching the error states in order to handle them using the specifications. The advantage is twofold: the imperative and more efficient implementation (in Java in their case) is used for the actual execution, fol-lowed by a verification and possibly an attempt at recovery through the more computationally expensive specifications.
• improve the developer’s experience by not requiring him or her to program the specific cases. In case they occur, an exception is thrown which causes the execution to fall back on the specifications.
These approaches rely on the programmer for the verifications but are both more refined than the defensive programming and can cover unforeseen faults. Further-more, the Plan B approach, similarly to the recovery blocks above, also provides mechanisms for attempting to recover in case of error.

The Mercury Programming Language

Mercury [51] issued from the observation that even if Prolog was more expressive than the imperative programming languages of the 1990s, it was not much used by companies. The two main arguments the creators of Mercury give are:
• the Prolog compilers do not detect enough compilation errors.
• the programs written in Prolog are sensibly slower than the ones written in imperative languages. Mercury is a strongly typed language, proposing a more evolved typing system than Prolog. It also has a means for analysing the input/output modes of predi-cates (i.e. the state of instantiation of variables of a predicate) and a determinism analyser (to identify the number of potential outputs of a predicate). These verifi-cations increase both the reliability of software by helping avoid certain runtime errors and the execution speed (e.g. no backtracking is performed on a determinis-tic predicate). However, this is done through language restrictions, in particular on the constructions that are outside the scope of the first order logic, e.g. the “cut”.
A compromise is thus required between the restrictions imposed on the program-mer and the ease of programming in a language. For the tolerance to unforeseen faults, we need to keep the chosen language usable, expressive and in the same time include restrictions to guide the programmer towards more reliable code.

Table of contents :

I introduction and state of the art
1 introduction
1.1 Raison d’Être
1.2 Weaving a Net
1.3 Separating Reasoning from Acting
1.4 Definitions and Working Hypotheses
1.5 Thesis Structure
2 state of the art
2.1 The Tolerance of Unforeseen Faults
2.1.1 The Observer
2.1.2 Anomaly detection
2.1.3 TibFit and Chameleon
2.1.4 Mission Data System
2.1.5 Recovery Blocks
2.1.6 A Case for Automatic Exception Handling
2.1.7 Defensive Programming
2.1.8 Design by Contract and Executable Specifications
2.1.9 Let It Crash
2.1.10 The Mercury Programming Language
2.2 Fault Tolerance with and for Agents
2.2.1 A Perspective on Exceptions in Multi-Agent Systems .
2.2.2 Communication Standards for Agent Fault Tolerance
2.2.3 Replication
2.2.4 Detecting Errors Through Agent Disagreement
2.2.5 The Sentinels
2.2.6 Norms. Trust and Reputation
2.2.7 Agent Autonomy for Robust Agents
2.3 Goal-Driven Agents
2.3.1 Describing Goals
2.3.2 The Goal Life-Cycle
2.3.3 Reasoning on Agent Goals
2.3.4 The Goal-Plan Tree
2.4 ALMA: An Agent Language for Dependable Agents
2.4.1 ALMA Motivations
2.4.2 Problem Solvers and Truth Maintenance Systems
2.4.3 Parenthesis on Model Based Diagnosis
2.4.4 The Programming Language
2.5 Conclusion
II contribution to the fault tolerance
3 a safety net approach to fault tolerance
3.1 Expecting the Unexpected: Error Detection
3.1.1 Exception-Based Detection
3.1.2 Objective-Based Detection
3.2 Avoiding Further Error Propagation: Confinement
3.3 System Recovery
3.3.1 Dependency Handling
3.3.2 Reparation
3.3.3 Reconfiguration
3.4 The Programmer’s Guide for a Safety Net
3.4.1 Language Requirements
3.4.2 Platform Requirements
3.4.3 Design Requirements
3.5 Discussion
4 an instantiation of the safety net
4.1 The Base Language
4.2 Extending ALMA for The Safety Net Approach
4.2.1 The unexpected Keyword
4.2.2 Goals
4.2.3 Plans
4.2.4 The ALMA+ Model and Language
4.3 The Three Fault Tolerance Phases in ALMA+
4.3.1 Detection
4.3.2 Confinement
4.3.3 Recovery
4.4 Extending the Platform
4.4.1 Language Extension Support
4.4.2 Safety Net Support
4.4.3 Agent Architecture
4.5 Discussion
5 experimenting
5.1 The CNP+ Scenario
5.2 Modelling the Agents
5.2.1 The Initiator Agent
5.2.2 The Main Contractor Agent
5.2.3 The Worker Agent
5.2.4 Giving Unanticipated Errors a Thought
5.3 Adding The Safety Net Mechanisms
5.4 The Safety Net at Work
5.4.1 Study by Type of Confinement
5.4.2 Study by Location of Error Occurrence in the Agent Code .
5.4.3 Other Error Situations
5.5 Discussion
III contribution to goal programming
6 the goal-plan separation
6.1 Goal-Plan Trees to Goal-Plan Separation
6.2 The Goal Reasoning Level
6.3 Mars Rover Scenario
7 gps method implementation
7.1 Examples of Possible Models for the Goal Reasoning Level
7.1.1 Reasoning through Rules
7.1.2 Reasoning Using a Planner
7.2 Reasoning through a Goal Plan
7.3 Reasoning through Multiple Goal Plans
7.4 Execution
7.5 Key Literature Aspects
8 experimenting with gps
8.1 An Application for Maritime Surveillance
8.1.1 In the Lead Role: The Aircraft Agent
8.1.2 GPS for Modelling the Aircraft Agent
8.1.3 Discussion
8.2 The Deployment of Ambient Intelligence Applications
8.2.1 Scenario
8.2.2 Multi-agent Modelling
8.2.3 Design and Implementation
8.2.4 Discussion
8.3 Overview
IV conclusions
9 conclusions
9.1 The Safety Net Approach
9.2 The Goal-Plan Separation Approach
9.3 Putting It All Back Together
V appendix
a controlling goal execution
b models of the cnp+ agents
b.1 The Initiator Agent
b.1.1 Agent Goals
b.1.2 Agent Plans
b.2 The Main Contractor Agent
b.2.1 Agent Goals
b.2.2 Agent Plans
b.3 The Worker Agent
b.3.1 Agent Goals
b.3.2 Agent Plans
c error response by location of occurrence in cnp+
bibliography