Network Protocol Support for Real-Time Systems
Network protocol design, for real-time systems, should be able to guarantee timely transmission of messages, especially, those meant for critical tasks. Kopetz  pointed out that even the OSI standard for network protocols does not consider the communication delays between two ends as an issue of interest. Hence, message scheduling is another important design issue in case of real-time systems. For timely execution of critical tasks, we need guaranteed scheduling for transferring messages deterministically within the timing constraints. Probabilistic scheduling strategy, based on the probabilistic/statistical information derived from network, could be used in cases where occasional delays in completion of hard tasks are tolerable by the system. Network protocols for real-time systems could be classified as event-triggered and time-triggered protocols. Event-triggered protocols are more useful in systems where the communication between subsystems are not predictable. Time-triggered protocols are effectively used in predictable systems where the system beforehand knows the schedule of messages and their acknowledgments in the network.
In controller area network (CAN), the nodes reach the bus via the CSMACA access .CAN is an event triggered protocol. Tindall  introduced a real-time architecture based on the CAN protocol. In CAN, the priority is deduced from its network identifier. For real-time scheduling, based on the RM algorithm, one can create network identifiers as a function of the period of the use of network by the node. Thus CAN protocol could be mapped as a predictable network protocol that could schedule messages with deterministic guarantees.
FIP (Factory Instrumentation Protocol) or WorldFIP is used for interconnecting sensors, actuators and processing units. FIP consists of a bus arbitrator node, which is dedicated for controlling or scheduling the communication between the other nodes in the network. FIP is a time-triggered protocol. Messages are scheduled for transmission based on the schedule that is stored as a bus arbitrator table. This table contains the instantaneous information about which nodes in the network should exchange data at any point of time. The information in this table dictates the performance of the protocol. Tovar  showed how RM and EDF algorithms could be used to ensure that the message exchange between the nodes could be done within deadlines. One important aspect about this protocol is that the communication is not initiated by the sender but it transfers the data when asked by the arbitrator . This is not the case in CAN, where the messages are always initiated by the sender.
FDDI (Fiber Distributed Data Interface) has a ring topology . In this protocol, there is only one node that has the token at a time for communication with other nodes. The token is circulated in the physical order among the nodes and each nodes have fixed amount of time to transmit the message. The parameter Target Token Rotation Time is sometimes used for decisions on instantaneous allocation of medium to any node. Apart from the real-time constraints on the FDDI protocol we need to consider the fault tolerance issues in the protocol because the protocol that we are working on is also ring protocol. Sankar  showed how reconfiguration can be achieved on the event of failure in FDDI protocol using two rings. Chen et. al.  developed FDDI based fault-tolerant communication protocol that provides support for timely delivery of hard real-time messages.
Few Existant Real Time Distributed Systems
In this section, we shall have a look at few established fault-tolerant realtime systems. We shall have a look at the Spring and MARS system. We shall see some notable features that these systems have.
Spring  is a real-time distributed system. The system consists of redundant arrangement of a multi-processor system. The application processors execute critical tasks of the system. The system processor makes the scheduling decisions and makes temporal adjustments to avoid unpredictable delays and system overheads on application processor. The I/O subsystem interfaces with peripherals that has an unpredictable response time. The system processor takes care of time critical I/O operations. Scheduling in RTOS The tasks in Spring system are classified as critical, essential and unessential. The critical tasks are guaranteed response at compile time. The essential tasks are scheduled online. The unessential tasks are scheduled during the idle time of the processor. In this system, a task is scheduled only if all its predecessor tasks are executed.Scheduling is implemented in the Spring RTOS at four levels.
– Every application processor runs a simple scheduler that picks task from the ready queue and marks it for execution.
– A local scheduler is run by the system processor that dynamically generates a feasible schedule for the tasks on application processors.
– The distributed scheduler tries to accommodate a locally non-guaranteed task in any other processor for feasible schedulability.
– This part of the scheduler adapts the scheduling algorithm according to the changes in instantaneous values of external parameters that affects the task schedule.
The Spring system has dedicated hardware for the purpose of scheduling. The Spring system categorizes the tasks and implements different scheduling policies for different kind of tasks. As an example, all critical tasks are guaranteed a feasible schedule beforehand while the essential tasks are scheduled online. The designers of the Spring real-time system presents us a paradigm that integrates “flexibility with predictability”. The real-time system scheduling architecture maps the different types of tasks such as the critical tasks, non essential tasks and aperiodic tasks as a predictable series of tasks for which the online scheduler finds an optimal schedule.
MARS stands for Maintainable Real-Time Systems . It is a fault-tolerant distributed real-time system. It was designed primarily for tasks with hard deadlines. The system is designed for predictable behavior even at high loads. The architecture comprises of a cluster that contains the computational systems along with the data acquisition nodes. These are connected to each other by a high speed synchronous real-time bus. Fault tolerance is achieved by active replication at cluster level. It uses TDMA protocol on Ethernet link for timely communication. Replication of messages is employed on these links for fault tolerance. Static priority scheduling ensures predictable behavior at overloaded conditions. The applications running on the MARS system has a dataflow architecture. The schedule for hard real-time tasks are computed offline based on the precomputed execution times of the task, message scheduling delays and communication between other tasks. The hard real-time tasks are executed according to this schedule. The soft real-time task are scheduled during idle time at lower load. Based on such design, MARS has a deterministic behavior during overload conditions.
Power Electronics Building Block Design
The CPES lab at Virginia Tech developed a novel technique for making control systems in power electronics [34, 35, 38]. This system eliminates the dependency on the analog circuits for control operations. This system has a real-time computing system that executes control applications. These control applications implement control algorithms that is derived from the functioning of the hardware circuit normally used in control system for power electronics.
Chapter 2 presented the most important parameters and their influence in the performance of a distributed real-time system. Also, Chapter 2 provided a brief look at some of the most important approaches considered for designing such a system. We even saw how few of the successful systems have adhered to certain common principles and policy decisions. Before getting into the details of the solution described in this thesis, we shall consider the different aspects of the design. After that, we redefine the problem statement in terms of a pure scheduling problem.
This section presents the key aspects of the system that affected the most important design choices.
The whole control function is fragmented into software components called ECO’s (Elementary Control Objects). These ECO’s are the basic execution units of the system. The ECO’s communicate with each other through data channels. The parent ECO writes its outputs into its output data channels. The child ECO reads its input parameters from the common data channels between the parent and child. In this system, the data channels form a basic memory unit for storage of intermediate values during execution. Most of these ECO’s are “stateless” entities. The term stateless because the ECO’s read values from its input data-channels, do some computations and write the results into the output data channels. It is very rare that an ECO performs computations based on its previous outputs. However, ECO’s that simulate circuits like regulators need the output values computed in previous switching cycle. Such ECO’s need to get updated every switching cycle. One way of doing this would be by using approximators that would compute the approximate value faster with simpler computations. If the task allocation can result in more slack time on a system, the slack time could be used for fault tolerance measures, in case of failure. Or we can have a pro-active non critical process for updating the state of the system, that gets scheduled during such slack time. We can decide on an acceptable switching period that is greater than the cumulative execution times of all the critical tasks and use this slack time for recovery measures on the event of failures. In order to implement a redundant computational system, we have to replicate the different execution paths of the dataflow graph. This could be done by replicating a set of ECO’s actively or passively. Passive replication implies that the set of ECO’s would be triggered for execution on failure notification of the master copy of the same set of ECO’s. Active replication is done by executing the same set of ECO’s parallely on separate processors. We should also note that updation of the output data channel of a passive ECO with the current values eliminates the need for active replication of those ECO’s on that processor. For time redundancy, adding slack time in between the execution of tasks is acceptable, if the resultant switching period from the schedule is acceptable.
The communication delay between the processors is a serious issue in this system. Communication between two ECO’s executing on different processors requires the following steps:
The parent ECO writes the output of its computation into its localoutput data channel.
The value written in the output data channel should be broke intodata packets and written into FPGA memory buffers. This step is costly because the processing time required to do this operation is comparable to the execution cost of an ECO. For future explanations, we would refer such a task as a send task.
Once the value is written in the FPGA buffer, the packets have to waituntil the node receives a null packet. The arrival of the null packets is completely dependent on the network traffic. Parol has explained a method of assuming the value for this delay.
Once the data packet enters the network, there is a deterministic guarantee that the packet would arrive the destination within a value of time that equals (N-1) times of a single nodal delay. This is the case when there is no failure in the primary ring. However, to be safe we take the value of the delay caused in the event of any possible nodal or link failure. Hence, we assume the value to be 2(N-1) times single nodal delay.
Once the data packets are written in the FPGA data buffer from thenetwork, they have to wait until the the DSP executes a task that reads from this buffer. The wait depends on how early such a task is scheduled. There is a high possibility that the new packets are written into the buffer before the old packets are read from it. This results a buffer-overflow. The node would no longer receive any new packets from the network. This situation is equivalent to the failure of that node. To make things worse, the nature of this failure is such that the other nodes in the network are never notified about its occurrence.
When the DSP schedules such a task, it reads from the FPGA memoryand writes them into the data channel. The cost of this task is same as the send task. We shall refer such a task as a receive task.
The execution time of a send task or receive task is comparable to that of other ECO’s. From the scheduling point of view, these tasks also add a precedence constraint that should be satisfied while scheduling. Once a packet is in the network, the communication delay depends on the number of nodes in the network. The typical nodal delay value is 1280ns. Hence if we have a system with 20 nodes, the communication delay with failure would sum upto 49920ns. This value is high compared to the execution cost of an ECO. Hence the communication cost could reduce the extent of the improvement in the performance when we schedule tasks with parallelism, based on the structure of the DAG.
Critical Nature of Tasks
Every tasks in this system are pre-run in the emulator and the execution time of all the tasks are set as their respective worst case execution time. All tasks would complete their execution within this worst case time. Thus all tasks have a predictable behavior. All tasks are critical in this system. Once the input driver ECO takes the sensor values from the external world, the output driver ECO should write the computed control value into the network within the end of the switching cycle. All ECO’s contribute to the computations for generating the output values. Hence, we cannot consider any of these tasks in the dataflow graph to be a non-critical task. However there are certain tasks that do not belong to the dataflow graphs. Tasks such as the ones for configuration of a newly added node in the network are non-critical and do not have any timing constraints. The ECO’s that are part of the dataflow graph are critical tasks and those that are responsible for the configuration of other nodes in the network are non-critical.
1.1 Control Systems in Power Electronics
1.2 Problem Statement
1.3 Fault Tolerant Embedded System for Power Converter Systems
1.4 Contributions of Work
1.5 Organization of Thesis
2.1 Real-Time Computing Systems
2.2 Scheduling Algorithms in Real-Time Systems
2.3 Dataflow Architecture
2.4 Fault Recovery in Real-Time Systems
2.5 Network Protocol Support for Real-Time Systems
2.6 Few Existant Real Time Distributed Systems
2.7 Power Electronics Building Block Design
3.1 Design Preamble
3.2 Assumptions Made for the Design
3.3 Fault Tolerance Strategy
4.1 Fault-Tolerant Task Allocation
4.2 Local Scheduler
4.3 Communication Task
5. Underlying Network Protocol
5.1 Startup and Configuration of the Network
5.2 Failure Management
5.3 Plug and Play Suppor
6.1 Distributed DARK Simulator
6.2 Evaluation Strategy
6.3 DRPESNET Block Simulation
7. Conclusion And Future Work
7.2 Future Work
GET THE COMPLETE PROJECT
SUPPORTING DISTRIBUTED FAULT TOLERANCE IN A REAL-TIME MICRO-KERNEL