RAMPAGER: a strategy relying on structure-guided LSTM initialisation

Get Complete Project Material File(s) Now! »

Society of agents

In early works, the society of agents was only characterised by the number of agents belonging to it [28][29]. Later the society was defined as a set of agents that are also able to leave it [42]. In this dissertation only closed societies of agents, noted A, are considered, that is societies with a constant number of agents, Na = card(A), throughout the mission.
Agents are homogenous, namely all agents are identical, i.e. they stand for the same type of unmanned aerial drones, and move at the same speed. Despite the fact that caracteristics of mobile robots involved in a mission may vary from one robot to another, the homogeneity hypothesis is usual in the MAP literature. Agents are anonymous, i.e. for every agent all the other agents are identical and it cannot distinguish them; for the designer, agents are identified by their ID, although their strategy cannot be an explicit function of the this ID. This hypothesis is implicitly assumed in some works [53], but rejected in other, where the nodes of the graph are dispatched among the agents [7].

Performance measures

An evaluation criterion, sometimes known as quality criterion [24], or metric, or per-formance measure, is a measurement which characterises a particular aspect of a MAP execution for a graph G and a duration of simulation T ∈ N∗. More generally, an evaluation criterion is used to assist in making an objective and fair procurement source decision [33]. In what follows, we shall describe the more relevant ones for MAP.

Idleness

In MAP, as all places must be visited as often as possible, the time elapsed since the last visit is a natural candidate as prime and basic criterion. By doing so, Machado et al. introduced the idleness of a node at time t ∈ R+, t ≤ T , where T ∈ R+ is the duration of the simulation, as the time elapsed since the last visit [29]. This has then led to the following criteria:
• the instantaneous node idleness at time t ∈ R, noted iv(t), ∀v ∈ V , corresponds to the time elapsed since its last visit by any agent at time t; if this does not lead to confusion, it will simply be termed as idleness.
• the instantaneous graph idleness at time t ∈ R, noted iG(t), corresponds to the average instantaneous idleness of all nodes a time t.
• the graph idleness or average idleness, noted iG, IavG, or simply Iav, being the average instantaneous graph idleness over all the execution’s duration.
• the worst idleness, noted imax,G, imax or W I, is the biggest value of instantaneous node idleness occurred during the entire simulation [29]: imax,G = max (iv(t)) (2.1.1) t∈R+:t≤T,v∈V.
As stated by Acevedo et al., it is worth noting that using the WI ensures that the probability an event not be detected by an agent is upper bounded to a known value [1].

Exploration time and decision cost

Machado et al. also introduced two other idleness-independent criteria:
• the exploration time, which consists of the time necessary to the agents to visit at least once all the nodes of the graph; such a criterion can be intuitively thought of as exploring an area in order to map it [29].
• the decision cost, which is given by the sum of all the durations required by every agent to choose the node to visit; assuming that agents stop at nodes to make the decision about the next node, such a criterion enables showing that the longer the decision takes time, the longer the nodes remain without being visited [28].

Qualitative measures

In addition to the MAP criteria defined above, Menezes et al. mentioned possible qualitative measures:
• scalability, that evaluates whether the strategy is capable of patrolling over topolo-gies of any dimension.
• stability, that measures the variation on the idleness of each node, indicating whether nodes are uniformly visited.
• adaptability, that evaluates the ability of the strategy to deal with topological changes during the simulation.
• reactiveness, that evaluates the ability of the strategy to patrol over various topologies without learning time, recalibration of parameters and definition of specific strategies to perform the task for each topology [32].

Cognitive strategies

In the context of MAP, cognitive agents can perceive the whole graph [29]. Cognitive agents are thus able to pursue a goal, namely to plan to visit a distant node. As previously indicated, for the application motivating this work, the focus is on the capacity to plan a visit to a remote node, rather than on perception skills.
Machado et al. proposed the Idleness Coordinator (IC) strategy [29], also called Cognitive Coordinated (CC) by Almeida et al. [2] and in further works. In this strategy, agents can perceive the whole graph and select the next remote node to visit as that with the highest true idleness. The agent is instructed to visit this node by a centralising coordinator. Everything happens as if there was a perfect communication between agents. Actually, agents communicate unrestrictedly with the coordinator and idlenesses are estimated by the latter on the basis of all information it has centralised. It is responsible for avoiding that more than one agent choose the same next remote node and responds instantaneously to provide it. When the coordinator is fully-informed, i.e. it has a global knowledge of executed and planned paths for all agents, it can be thought of as an omniscient coordinator. Eventually, agents use the Floyd-Warshall algorithm to compute the shortest path between their current node v0 and the assigned goal v1. This algorithm minimises the time-to-go from v0 to v1, i.e. d(v0, v1).
Almeida et al. introduced two methods to improve the CC strategy, the first termed Heuristic is used to select the next remote node to visit exploiting more information than just true idleness, whereas the second one, termed Pathfinder, is used to compute the path to go there exploiting more information than just distance [2].

READ Measuring Consensus in a Preference-Approval Context

Heuristic method

The Heuristic method enables selecting the next goal node by taking into account not only the normalised idleness, but also the normalised time-to-go of a candidate goal node from the agent’s current position. The time-to-go between two nodes of V is the travel time of the shortest path between these two nodes. Idleness and time to go are normalised by scaling them between 0 and 1. Because edges with high idleness values ought to be traversed first, 0 is assigned to the maximum idleness whereas a value equal to 1 is assigned to the minimum idleness. Intermediary values are calculated by means of proportions as shown in Eq. 2.3.1: ∀ v ∈ V, if min(i (t)) = max(i (t)), v ∈ V 0 v∈V v ̸ v∈V v ∀ 0 ¯ max(i (t)) − i v0 (t) (2.3.1).

Heuristic and Pathfinder methods

Applying this two methods with CC gives rise to the Heuristic Pathfinder Cognitive Coordinated (HPCC) strategy. The decision process of HPCC includes then two steps:
• Heuristic method, i.e. the selection of a target node that is not necessary in the neighbourhood.
• Pathfinder method, the computation of a path between the current position of the agent and the target node previously selected.
In the HPCC strategy, the coordinator makes a decision by applying these two methods that take as inputs the vector of true idleness and the vector of agent positions as shown in Figure 2.2.

Markov-decision-process-based strategies

A Markov decision process (MDP) is a decision-making model for any discrete time stochastic control process. A MDP is represented by a tuple < S, Ac, P, R >, where:
• S is the set of states, namely for a single agent, the nodes of the graph G.
• Ac is the set of actions, consisting in moving among nodes.
• P : S × Ac → K(S) — where K(S) is the credal set of S located — i.e. the set of probability distributions over S —, is the function of probability distributions making up the state transition function: for each state s ∈ S, this function returns a probability distribution P (s, a), where ∀s′ ∈ S, P (s, a)(s′) = P (s, a, s′) is the probability to reach the state s′ from s by applying action a.
• R : S ×A → R, is the immediate reward, also known as expected immediate reward.

Multiagent patrolling and reinforcement learning

Santana et al. addressed for the first time the MAP problem as a reinforcement learning (RL) problem [53]. They demonstrated that an efficient cooperative behaviour can be achieved by means of Q-learning, by modelling MAP as a Semi-Markov Decision Process (SMDP), which is merely a generalisation of MDPs where actions may take a variable amount of time. In this model the area to patrol is still depicted as a graph, and the model being an MDP, it is defined by the tuple (S, Ac, P, R) outlined above, where ∀t ∈ N∗, the state st corresponds to both the positions of all of the agents and the idlenesses of all the nodes of G. Then, let:
• ∀a ∈ A an agent, pa(t) ∈ V ∪ E be the position of a at time t.
• ∀a ∈ A, ∀t ∈ N∗, ∀p ∈ V ∪ E, ip(t) is the true idleness of p at time t; if p is an edge, then ip(t) = 0, otherwise ip(t) = iv(t) such as p = v ∈ V .
The idleness of pa(t) at time t, ipa(t)(t), is then used as reward function, and the multiagent policy learned by the Q-learning algorithm is that which maximises the sum of the discounted reward after a T -period simulation run, ∀T ∈ N: T γt × ipa(t)(t) (2.3.9).

Table of contents :

List of figures
List of tables
Nomenclature
1 Introduction
1.1 Context and motivations
1.2 Overview
2 State of the art
2.1 Multiagent patrolling
2.1.1 Environment
2.1.2 Society of agents
2.1.3 Performance measures
2.1.4 Discrete time model
2.2 Methodology
2.3 Classical strategies
2.3.1 Reactive strategies
2.3.2 Cognitive strategies
2.3.3 Graph-theory-based strategies
2.3.4 Markov-decision-process-based strategies
2.4 Machine-learning-based strategies
2.4.1 Bayesian learning
2.4.2 Neural-learning-based patrolling
2.5 Conclusion
3 Model, methodology and implementation
3.1 Model and definitions
3.1.1 Model of the MAP problem
3.1.2 Definitions
3.2 Types of stationary strategies and structure of resultant data
3.3 Methodology
3.4 Implementation
3.4.1 PyTrol
3.4.2 MAPTrainer
3.4.3 MAPTor
3.5 Model strategy and databases
3.5.1 First database: HCC 0.2
3.5.2 Second database: HPCC 0.5
3.6 Conclusion
4 Path-Maker: a decentralised strategy based on node prediction
4.1 Path-Maker
4.1.1 Path-Maker
4.1.2 RNN-Path-Maker: an implementation of Path-Maker
4.1.3 Deterministic-Path-Maker
4.1.4 Random-Path-Maker
4.2 Training procedure
4.2.1 Pretraining
4.2.2 Main training
4.3 Experiments and results
4.3.1 Conduct of experiments
4.3.2 Training settings
4.3.3 Training results
4.3.4 Simulation results
4.4 Conclusion
5 RAMPAGER: a strategy relying on structure-guided LSTM initialisation
5.1 Procedure of training
5.1.1 Structure-guided initialisation: an analytical initialisation
5.2 Experiments and results
5.2.1 Conduct of experiments
5.2.2 Preliminary experiments: selection of the LSTM setting
5.2.3 Training settings
5.2.4 Training results
5.2.5 Simulation results
5.3 Conclusion
6 Idleness estimator: a decentralised strategy based on idleness estimation
6.1 Idleness estimation
6.2 Decision-making based on idleness estimation
6.2.1 Deterministic approach
6.2.2 Drawback of the deterministic approach
6.2.3 Stochastic approach
6.3 Some statistical models for idleness estimation
6.4 Training procedure
6.5 Experiments and results
6.5.1 Training settings
6.5.2 Training results
6.5.3 Simulation results
6.6 Conclusion
7 Interacting Idleness Estimator: a strategy based on interaction
7.1 Interacting Idleness Estimator
7.1.1 Peer-to-peer interaction
7.1.2 Transitive interaction
7.2 Experiments and results
7.2.1 Training results on HPCC 0.5 data
7.2.2 Simulation results
7.3 Conclusion
8 Conclusion
References