Single-Site Resource Utilisation, BoundedWorkers

Get Complete Project Material File(s) Now! »

Generalized Snapshot Isolation (GSI) And Prefix-Consistent Snapshot Isolation (PCSI)

SI was defined in [12] for single databases and it imposes transactions to read from the latest snapshot. Distributed databases need to synchronise to guarantee this requirement. This synchronisation step can potentially delay transaction execution. This delay also affects read-only transactions, and compromises one of the main benefits of SI, to never delay or abort read-only transactions.
Elnikety et al. [25] have extended the SI definition to overcome this issue. They propose GSI, an isolation level that allows transactions to read an “older” snapshot. In GSI a read-only transaction is never delayed or aborted and does not causes update transactions to delay or abort. However, update transactions are more likely to abort: the older is the snapshot in which they execute, the larger the risk of a concurrent write causing the abort. Another observation is that GSI as is defined allows snapshots to be arbitrary old, and does not impose monotonicity of snapshots (i.e., for any two consecutive snapshots provided by the same replica the second snapshot is at least as fresh as the frist).
This violates the session guarantees. A further refinement is PCSI [25]: a transaction must read a snapshot that contains at least locally-committed transactions. PCSI maintains the desirable properties of GSI, with the guarantee that a transaction sees at least the writes that have committed at the local replica. If a client always contacts the same replica, PCSI maintains session guarantees.

Parallel Snapshot Isolation (PSI)

More recently, Sovran et al. have introduced PSI [60], an isolation level similar to SI, which allows the system to replicate transactions asynchronously. It is designed for geo-replication and does not require a global transactions ordering. The key observation is that SI imposes a global order of all snapshots. In geo-replicated systems this is expensive, because it imposes to coordinate transactions on commit, even if there are no conflicts. PSI allows different commit orderings at different sites preserving causal ordering [39]: if a transaction T2 reads from T1 then T1 is ordered before T2 at every site. The idea is to provide SI inside datacenters and relax SI to PSI across datacenters.

Active Replication vs Passive Replication

Two main approaches for update dissemination are: active replication, or state machine replication [39, 47, 54] and passive replication or primary-backup replication [14, 19]. In an active replication approach, all replicas execute all operations in the same order. The assumption is that transactions are deterministic. A transaction can be both nondeterministic itself (e.g, because of some operation based on actual timestamp) or have non-deterministic execution (due to the DBMS or the underlying operating). In the second case, achieving determinism can be hard, especially in multithreaded or distributed concurrent systems.
In passive replication, an update is performed at a single distinguished site, usually called the “primary” site, then state updates are sent to the other replicas that update their state. This overcomes the requirement for determinism because a transaction is executed once and then each replica will receive and apply the same state updates. Passive replication approaches can be further divided into single- or multi-primary. In single-primary systems, there is only one replica that may process update transactions, whereas in multi-primary systems several (or all) replicas can perform updates. These approaches are discussed in Section 2.2.2.3.
The trade-off between the two approaches is a balance between computational resources and bandwidth usage. Active replication generally uses more computational resources than passive replication, because it executes all operations (i.e., the transaction code) at each replica. On the other hand, if the updated state is large, passive replication will be more bandwidth-greedy because it broadcasts the state updates of all transactions. Another consideration is the efficiency and the scale-up of the two approaches. In active replication, increasing the number of replicas does not improve throughput of updates because updates transactions will be executed identically at every site. In passive replication, increasing the number of replicas can improve the update throughput (up to a saturation point) if executing transactions takes longer than just applying their write-set, as is usually the case.

Eager Replication vs Lazy Replication

Another critical factor to consider for execution latency in the update dissemination strategy, is whether replicas are updated as part of the transaction (i.e, synchronously), or asynchronously, off the critical path. Those strategies are known as eager replication and lazy replication respectively. Eager replication protocols update all replicas as part of the transaction commit. When a transaction commits all replicas have the same value. Eager replication has the advantage that all replicas are up-to date at transaction commit: they are strongly consistent, and they have no anomalies. The drawback is that any multi-primary (also called update-anywhere) eager replication approach requires distributed locking and two phase commit [37]. Furthermore, Under eager replication, the probability of deadlock is proportional to the third power of the number of replicas [29].
Lazy replication protocols update replicas asynchronously, after the transaction commits. This improves efficiency and scalability, but gives up on consistency. Since updates are propagated after transaction commit, at a given time, replicas may have different values. This inconsistency can confuse users and applications because they are exposed to anomalies.

Approaches to Scale-up Replicated Databases

In this section we overview some systems that use the replication strategies discussed earlier (e.g., primary-copy, update-anywhere, lazy replication, active and passive replication). We also discuss how they are related to our system.
H-Store [63] is a DBMS designed and optimised for OLTP applications. It requires the complete workload to be specified in advance as statically defined stored procedures. This advance knowledge allows H-Store to partition and to parallelise the load between different single-threaded replicas operating in a share-nothing environment. Under this assumption, H-Store improves the performance by orders of magnitude compared to other commercial database. Gargamel also parallelise the load between single-threaded replicas, but using an unmodified DBMS. Furthermore, Gargamel requires only an approximate knowledge, encapsulated in the conflict classifier, and it does not require the whole workload to be specified in advance.
The system of Pacitti et al.[46] is a lazy multi-master replicated database system. It enforces a total order of transactions by using reliable multicast and a timestampbased message ordering mechanism. Their system avoids conflicts at the expense of a forced waiting time for transactions, and rely in a fast cluster network to reduce the waiting time. Gargamel, in contrast, is designed for geo-replication. In order to scale in Wide Area Network (WAN), Gargamel enforces a partial order of transactions and synchronise optimistically among distant sites.
Ganymed [48] and Multimed [50] centralize updates. In contrast, Gargamel is capable of parallelising at least some update transactions. Therefore, it does not need a master replica, which constitutes a scalability bottleneck.
Sarr et al. [51] introduce a solution for transaction routing in a grid. Their system, like Gargamel, is conflict-aware. However, they check for conflicts only in order to propagate updates among replicas in a consistent way; they do not serialise conflicting transactions, as Gargamel does. A conflict-aware scheduling system is proposed by Amza et al. [4, 5]. Their system ensures 1-Copy Serialisability (1-Copy SER) by executing all update transactions in all replicas in a total order. Gragamel parallelises non-conflicting write transactions and transmits the write-sets off the critical path. Moreover Gargamel executes a given transaction only once, at a single replica, which ensures that replicas do not diverge in the presence of non-determinism.

READ DEVELOPMENT OF A NEW MS-BASED MULTIBIOMARKER ABSOLUTE QUANTIFICATION IN GAMMARUS

Distributed Scheduling and Collision Resolution

Schedulers communicate asynchronously with one another, and are loosely synchronised. A scheduler can receive a transaction, either from a local client (we call this a home transaction) or from another scheduler (a remote transaction). First consider a home transaction. The scheduler performs the normal scheduling algorithm described in Section 3.3. This is done without a priori synchronisation with the other schedulers.
Once it has scheduled the remote transaction, the scheduler forwards it to all the other sites (asynchronously and reliably) along with its scheduling position. Schedulers use reliable FIFO broadcast to propagate transactions between them. Once committed by the local database, the transaction’s write-set is propagated to all other nodes (in both local and remote sites). The latter apply the write-set without re-executing the transaction [47]. Thus, although (as we shall see) a given transaction is scheduled at all sites, it nonetheless consumes computation resource at a single site only. This avoids duplicate work, and divides the load (Section 4.3.3).
Algorithm 2 describes the scheduling protocol in pseudocode. Logically, it can be divided into two parts: scheduleLocal (), called when a client sends a SCHEDULE message to its local scheduler, and scheduleRemote() called when a scheduler sends a SCHEDULE_REMOTE message with its local dependencies to other schedulers. The calculateBackDependencies() function (which computes whether the incoming transaction conflicts with a transaction in the local graph) and getExecutingWorker() (which selects the worker that will execute the transaction) are the same as in Algorithm 1.
The variables of the scheduling protocol (omitted in the pseudocode for space reasons) are localBackDependencies and remoteBackDependencies: the list of all the conflicting transactions in the local and remote graphs respectively. The localBackDependencies list is sent along with the SCHEDULE_REMOTE message. The other variables: transactionQueue, graphHeads, targetWorker and precedingWorker are as in Algorithm 1. We recall that transactionQueue is a queue used for breadth-first exploration of the graph; graphHeads is a hash set containing all the “heads” of the graph, i.e., transactions without any front dependencies; targetWorker is the ID of the worker chosen to execute the transaction; and precedingWorker is a temporary variable, used to iterate over worker candidates for executing the incoming transaction.

Table of contents :

Part I – Thesis
1 Introduction
1.1 Problem Statement
1.2 Research Proposal and Contributions
1.3 Outline of the Thesis
2 State of the art
2.1 Introduction
2.2 Distributed Databases
2.2.1 Consistency Properties
2.2.2 Replication Strategies
2.2.3 Concurrency Control
2.2.4 NoSQL
2.3 Approaches to Scale-up Replicated Databases
2.4 Conclusion
3 Singe-Site Gargamel
3.1 Introduction
3.2 System Architecture
3.2.1 System Model
3.2.2 Scheduler
3.2.3 Classifier
3.2.4 Nodes
3.3 Scheduling Algorithm
3.3.1 Isolation Levels
3.3.2 Correctness
3.4 Conclusion
4 Multi-Site Gargamel
4.1 Introduction
4.2 System Architecture
4.3 Distributed Scheduling and Collision Resolution
4.3.1 Collisions
4.3.2 Collision and Synchronisation Protocol
4.3.3 Determinism and Duplicate Work
4.4 Fault Tolerance
4.4.1 Scheduler Failure
4.4.2 Node Failure
4.5 Conclusion
5 Simulation
5.1 Simulation Model
5.1.1 TPC-C
5.1.2 TPC-E
5.1.3 Round-Robin, Centralised-Writes and Tashkent+ Simulation
5.2 Simulation Results
5.2.1 Single-Site Performance
5.2.2 Single-Site Resource Utilisation, BoundedWorkers
5.2.3 Single-Site Resource Utilisation, UnboundedWorkers
5.2.4 Multi-Site Gargamel
5.2.5 Adaptation to The Number Of Workers
5.3 Conclusion
6 Gargamel Implementation
6.1 TPC-C Client Implementation
6.2 Scheduler Implementation
6.3 Node Implementation
6.3.1 Certification Algorithm
6.4 Classifier Implementation
7 Gargamel Evaluation
7.1 Single Site Evaluation
7.1.1 Benefits of the Pessimistic Approach
7.1.2 Benefits of the Passive Replication Approach
7.1.3 Resource Utilisation
7.2 Multi-Site Evaluation
7.2.1 Benefits of the Multi-Site Deployment
7.2.2 Impact of Database Performance
7.2.3 Impact of collisions
7.3 Conclusion
8 Conclusion
8.1 Summary
8.2 Perspectives
Part II – Résumé de la thèse
9 Résumé de la Thèse
9.1 Introduction
9.1.1 Définition du Problème
9.1.2 Contributions : Gargamel
9.2 Gargamel Mono-Site
9.2.1 Architecture du Système
9.2.2 Modèle de système
9.2.3 Conclusion
9.3 Gargamel Multi-Site
9.3.1 Introduction
9.3.2 Architecture du Système
9.3.3 Conclusion
9.4 Simulation
9.5 Implementation de Gargamel
9.6 Conclusion
Bibliography