Fukushima Daiichi Nuclear Accident
On March 11, 2011, the Great East Japan Earthquake occurred . The amplitude of the main shock was 9.0 and lasted two minutes. At the Fukushima Daiichi Nuclear Power Plant (NPP), three boiling reactors (units 1, 2 and 3) were operating and the three others (units 4, 5 and 6) were shut down for refuelling and maintenance when the earthquake occurred.
The NPP detected the earthquake immediately and protection mechanisms auto-matically shut down the three operating reactors. One must know that after being shut down, reactor cores continued to generate heat and the cooling systems have to be maintained to avoid overheating. This phenomena is called the decay heat and is predictable. However the power lines that supplied AC power to the NPP were damaged by the earthquake. In this case, as described in procedures, emergency diesel genera-tors automatically started to supply AC power to cooling systems of the six units. Thus, at this time, the NPP reacted as designed and heat and pressure were under control for all six units of the plant.
Unfortunately, the earthquake had also generated a tsunami (with “tsu” meaning “harbour/ford” and “nami” meaning “wave” in Japanese) that overwhelmed seawalls and flooded and damaged many buildings and equipment. In fact, seawalls were designed according to historical data from the past fifty years and were expected to protect against a maximum tsunami height of 5.5m. But some waves of this tsunami reached about 15m. Amongst damaged equipment were the diesel emergency gen-erators and some of the DC batteries that were supposed to be used as substitutes if AC generators were shut down or unavailable. This resulted in a blackout for the units 1, 2 and 4. As a consequence, operators were not able to monitor or control critical plant parameters such as pressure and water temperature and level anymore. As no operating procedures were addressing the case of the loss of both DC and AC power, it was primarily decided that the emergency response was to restore power as a priority.
The most critical situation concerned the unit 1 and fresh water was injected into the reactor core once the internal pressure allowed it. When fresh water tanks were depleted, it was decided to inject seawater from the tsunami flood. About one day after the tsunami, as the situation improved, a series of explosions caused by hydro-gen leaks damaged the seawater injection and temporary power lines that had been installed. On top of that, the cooling system of the unit 3 stopped and the reactor be-gan to overheat and explosions occurred in unit 3 two days after. The next day, a new series of explosions damaged units 2 and 4 and shut down their cooling systems. It took nine days for the first reactor (unit 5) to reach cold shutdown and more than three months to achieve a significant suppression of radioactive releases. To sum up, the reactor cores in Unit 1, 2 and 3 overheated, the nuclear fuel melted and their con-tainment vessels were breached. Hydrogen leaked from some breaches and provoked explosions that injured workers, damaged structures and equipment and slowed down repairs. Radioactive particles were released for months to the atmosphere and con-taminated lands and ocean around the plant. People within a radius of 20km of the plant were evacuated.
A Need for Resilience
The first three examples illustrate that cyber-security is needed to prevent cyber-physical attacks. However, it is not sufficient as highly motivated and resourceful op-ponents would still be able to break in and perpetrate attacks. On top of that, safety and dependability models have shown some limits and major crisis can occur even in the case of the most critical infrastructures such as the Fukushima Daiichi NPP or the Natanz uranium enrichment plant.
To address these weaknesses, research is therefore being conducted to improve system resilience. However, even before proposing ways to improve the resilience of a system, it is necessary to be able to assess, or even quantify, its current resilience or the gain in resilience that would result from system modifications. The work presented in this thesis enters into this perspective.
Several models and metrics have been developed during the two last decades in order to quantify system resilience. However most of them have a common charac-teristic: they require an extensive knowledge of undesired events and of their impact on the system to be protected. It implies that the exhaustive list of undesired events must be established or that the assessment of resilience can be done only once the undesired event had occurred on the system. Moreover, resilience evaluations of this kind are event specific and the global resilience of the system is basically the sum of the resilience evaluation for all possible events. Therefore, resilience metrics have the same limitations as safety and dependability models as they need undesired events to be specified.
In order to remedy this problem, a new model of resilience evaluation is proposed. This model allows the evaluation of the resilience of a system by detailing its com-ponents and their relationships. Thus the approach of this model is focused on an extensive knowledge of the system itself rather than an extensive knowledge of pos-sible attacks and failures, their impact on the system and the system behaviour under challenge.
Several definitions and metrics of resilience have been developed since Holling’s article  in 1973. Many of them are described and compared with each other in Chapter 2. Regarding definitions, many attributes have been used to describe re-silience and definitions can be classified depending on attributes that are used. These attributes also allow the classification of metrics. Indeed, many metrics are built on separate evaluations of different resilience attributes. But other criteria are considered such as the need to specify undesired events, the use of probabilities or fuzzy logic.
Once many metrics have been described and explained, our evaluation model is presented in Chapter 3. In this model, a system is represented as a directed graph whose vertices are components and edges are service provider/consumer relation-ships. Thus the description of the model consists in: firstly the description of system services as a lattice of different dimensions and values that correspond to the likelihood of services to be delivered, and secondly the description of components that explains how components can manipulate input values in order to compute output values cor-responding to their produced services. Besides, the model is biased for some system configurations. This bias, called the “double counting problem”, is explained and solved in this chapter. Chapter 4 is composed of two parts. An implementation of the model has been made in F# and the main implementation choices are presented and explained in the first part of this chapter. In the second part, a use-case representative of a typical industrial system is described. Then several configurations of this use-case are rep-resented using our resilience evaluation model. Resilience is evaluated for all these configurations and results are compared with each other to determine which configu-ration is the most resilient.
Two works are proposed in Chapter 5 in order to improve the usability of the model. The first improvement consists in adding symbolic calculation to the existing implemen-tation so that deeper analysis of resilience can be performed. Details of this implemen-tation are given and explained. The second improvement aims at providing an easier and usable way to describe components of a system. This is done by representing the behaviour of components with matrices and working with a simpler version of the resilience model, as detailed in this chapter.
A System Property
In , resilience is described as a system property to endure undesired events in order to ensure “the continuity of normal system function”. This ability corresponds to three system’s capacities: absorptive, adaptive and restorative capacities. It could be considered that this definition goes against the original concept of resilience given by Holling in  as the continuity of normal function can be considered as a synonym of system stability. However, the authors also specify that resilience postulates flexibility in terms of performance, structure and function while these changes are not irreversible or unacceptable.
Resilience is defined in  as the maintenance of “state awareness and an ac-cepted level of operational normalcy in response to disturbances”. Operational nor-malcy corresponds to the maintenance of “stability and integrity of core processes” according to  and resilience is described by Wreathall  as the ability to “keep, or recover quickly to, a stable state”. These definitions confirm the previous description as resilience focuses on some operational stability even if systems are supposed to “tolerate fluctuations via their structure, design parameters, control structure and con-trol parameters” . A new point highlighted by this definition is the need to collect and fusion data concerning the current state of the system. This knowledge aims at knowing the current date of the system and its environment and is a basis for decisions . Processes to collect, merge and prioritise information should be considered when designing resilient systems. Indeed, resilient systems should not be considered as a single technology but as a complex integrated system of systems that ensures coordination among subs-systems through communication and sharing of information .
Resilience Is Related to Service Delivery
In , considered systems are networks and their resilience is defined as the abil-ity “to provide and maintain an acceptable level of service in face of various faults and challenges to normal operation”. This definition is close to another one given by Laprie in  where resilience is « the persistence of service delivery that can justifiably be trusted, when facing changes”. For both definitions, resilience focuses on service de-livery and particularly on avoidance of service failure. System services are the system behaviour as it is perceived by its users . They are different from system functions which correspond to the expected result of the system behaviour, in other words what the system is intended to do. Delving into a more specific domain of cyber-physical system, Clark and Zonouz  define resilience as the “maintenance of the core [. . .] set of crucial sub-functionalities despite adversarial misbehaviors” and a guarantee of “recovery of the normal operation of the affected sub-functionalities within a predefined cost-limit”. Again, this definition reinforces the need to maintain a service delivery above a fixed threshold. If a perturbation leads the system to be under this threshold, then the system is in an unacceptable state and has failed to be resilient. Power systems are considered in  and resilience is defined as the “ability to maintain continuous electricity flow to customers given a certain load prioritisation scheme”. According to the authors, traditional risk assessment is not the best ap-proach to achieve resilience as resilience concerns “unexpected rare extreme failures” whose likelihood cannot be easily estimated. Thus, this definition completes the previ-ous ones as it focuses on service delivery and underlines that some services are more critical than others and should not be interrupted.
A commonly accepted definition of resilience is given in . In this article, Vugrin et al. describe resilience as the ability of a system, for a given disruptive event, to “reduce ‘efficiently’ both the magnitude and the duration of the deviation from targeted ‘system performance’ levels”. This definition has frequently been used to propose resilience metrics based on system performance such as some metrics detailed in section 2.4.1 and section 2.4.2. This definition and its derived metrics also imply that a system has different levels of resilience to different disruptions and an evaluation of resilience is needed for every specific disruption.
Ayyub proposes a close definition in  as resilience is “the ability to prepare for and adapt to changing conditions and withstand and recover rapidly from disruptions”. On the contrary of the previous definition, resilience is not only concerned by the oc-currence of disruptions but is also considered in a pre-disruption phase as a need for preparation and evolution is pointed out by this definition. Another similar definition is given by Haimes in  as resilience is “the ability of a system to withstand a major disruption within acceptable degradation parameters and to recover within an acceptable time and composite costs and risks”. Compared to the previously described definitions, Haimes points at the need to estimate the cost of the recovery process. However, some definitions do not consider the amplitude of disruptions. In , resilience is “the ability to recover as soon as possible after an unexpected situation”. The authors nevertheless point out the need to minimise disruptions consequences but only with a view of faster recovery.
A recent work suggests looking at resilience with a different perspective. In , a system is considered as a set of resources for which particular states are expected such as ensuring personal safety, preserving confidentiality of a database, etc. Security is the system capacity to maintain expected states of resources. However, security breaches can occur and resilience is defined as “the maintenance of a nominated state of security”. This resilience is achieved by detecting, containing and resolving a security breach. While many approaches only consider resilience of accidental faults, this one seems to focus only on attacks.
Description of Resilient Systems
It is commonly accepted that resilience of a system is supported by three system capacities. These capacities are first described in . In this article, Holling com-pares the resilience of a population with a game “in which the only payoff is to stay in the game”. Thus a resilient population has “a high capability of absorbing periodic extremes of fluctuation”, maintains “flexibility above all else” and can “restore its ability to respond to subsequent unpredictable environmental changes”. They are known as absorbability, adaptability and restorability and are considered so central to the notion of resilience that they are frequently used to define resilience [38, 95].
Table of contents :
1.1 Maroochy Water Waste
1.4 Fukushima Daiichi Nuclear Accident
1.5 A Need for Resilience
2.1 Preliminaries – Definition of Critical Infrastructures
2.2 Resilience Definitions
2.2.1 A System Property
2.2.2 Resilience Is Related to Service Delivery
2.2.3 Events Handling
2.2.4 Other Definitions
2.3 Description of Resilient Systems
2.3.4 Other Capacities and Descriptions
2.4 How to Measure Resilience
2.4.1 Quantitative Deterministic
2.4.2 Quantitative Probabilistic
2.4.3 Fuzzy Models
2.4.5 Adversarial Events
2.5 Resilience Compared With Other Notions
2.5.1 Risk Assessment
2.5.3 Other Notions
2.6.1 Gaps and Limitations
2.6.2 Concluding Remarks
3 Resilience Model: an End-to-End Approach
3.1 Preliminaries – Fuzzy Logic
3.1.1 Fuzzy Sets
3.1.2 Operations on fuzzy sets
3.1.3 Triangular norm and co-norm
3.2 System Model
3.2.1 System Components and Sources
3.2.2 Lattice of Data Dimensions
3.3 Measure of Resilience
3.3.1 External Consistency
3.3.2 Representation of Data
3.3.3 Resilience of Sources and Components
3.3.4 Computing Resilience across the System
3.4 Overestimation of resilience
3.4.1 Double counting problem
3.4.2 Principle of global loss of external consistency
3.5 The HTTPS Browser Use-Case Extended
4 Use Case and Implementation
4.1 Algorithm and complexity
4.1.1 Lattice Implementation
4.1.2 Graph Implementation
4.1.3 Data and Resilience Functions Implementations
4.1.4 Resilience Evaluation Algorithm
4.2 Use-case : FischerTechnik platform
4.2.1 Brief description of the platform
4.2.2 System description
4.2.3 Model application to diverse configurations
5 Usability Improvements
5.1 Use of symbolic computation
5.1.1 Implementation details
5.1.2 Application to the FischerTechnik use-case
5.2 Easing the definition of resilience functions
5.2.1 Matrix operators using t-conorms
5.2.2 Representation of a partial ordered set of data dimensions with matrices
5.2.3 Resilience functions using matrices
5.2.4 Computing the external consistency of data
5.2.5 Concluding remarks
6 Conclusion and perspectives
A List of components of the FischerTechnik platform
B Resilience functions for the FischerTechnik use-case
C Symbolic calculation simplification rules