Causally-Consistent Object Database for Client-Side Applications

Get Complete Project Material File(s) Now! »

System Model and Object Implementation

In this section, we formalize an implementation model of Replicated Data Type. We use the model of Burckhardt et al. [34] that can express existing implementations [16, 32, 85, 92]. consider a replicated database that hosts one or more named objects on a set of replicas. We assume that the database implements each object independently. In other words, a database consists of a single object, without loss of generality.4 The type signature of an object defines its interface. An object implementation defines its behavior, including its replication protocol.
An application has multiple processes that run client sessions. Each session issues a sequence of operations to a replica by calling methods; for simplicity, we assume that every session matches exactly one logical replica. A database replica delegates method calls to the corresponding object implementation. It responds by performing a local computation step. Local processing ensures that all operations are highly available and responsive. Replicas communi- cate in the background via a message passing network layer to exchange updates. An object implementation includes functions that produce and apply messages.

Specification of Intended Behavior

The primary expectation behind any replicated object is convergence. Category-specific convergence conditions can be successfully applied to implementations [92], including the examples from the previous section. However, convergence alone is not sufficient. Towards what value the object converges, and what are the intermediate states and their values, is also important. For instance, an increment-only counter that provably converges towards 0, or occasionally returns value 0 before it converges, does not provide the intended behavior.
A precise specification of type semantics can act as a contract between an application and a data type implementation. There can be multiple implementations of the same type, including different optimizations or models (e.g., Algorithm 2.2 vs. Algorithm 2.3). It would be natural to express that these implementations behave in the same way. Implementation is often not an effective way of expressing semantics, since it requires considering low-level details of implementation and execution (consider, for example, Algorithm 2.5, or an execution from Figure 2.4).
In this section, we define and illustrate a declarative RDT specification model, after Burckhardt et al. [34]. This form of specification offers a unified and concise representation of RDT semantics, that embodies the convergence condition.

Implementation Correctness

The semantics of an implementation can be characterized by all of its executions.
Definition 2.7 (Implementation Semantics). The semantics of an implementation Dø, noted ÇDøÉ, is the set of all its concrete executions that start in the empty configuration (R0,M0). Some implementations (or categories of implementations) may not provide a meaningful behavior under all executions, but only when certain network layer conditions are met. For example, the op-based counter from Algorithm 2.2 requires operations to be delivered exactly once, rather than an arbitrary number of times. Similarly, the LWW register may require a condition on the order of supplied timestamps. We express such restrictions as a network layer specification, noted T, a set of allowed concrete executions defined by a condition on concrete executions.9 Therefore, when considering the correctness of an implementation, we will reason about ÇDøÉ\T, i.e., the semantics of an implementation Dø under network specification T.
In order to state a correctness condition, we would like to relate each concrete execution with a correct witness abstract execution, i.e., to find a correct representation of a concrete execution in the specification domain. Both types of executions, given by Definitions 2.2 and 2.6, share a similar structure, and most components of a witness execution can be directly extracted from the concrete execution. Thus, we define a witness execution for concrete execution C 2 ÇDøÉ\T as: (2.12) abs(C,V) = (C.E|do, E.replica|do, E.op|do, E.rval|do, ro(C)|do, V(C), ar(C)).

Prior Implementations and Unsuccessful Optimizations

A number of other important data types exist, for which we did not manage to find a correct optimization. Some of them provide a similar behavior to add-wins set and multi-value register, and could act as their replacement. We report on their complexity here, which motivates some of the questions regarding optimality that we address in Chapter 4.

Remove-Wins Set

Given that implementations of the add-wins set have noticeable overhead, it is natural to consider other semantics of sets. Algorithm 3.5 is an implementation of remove-wins set semantics FRWSet (Equation 2.10) by Bieniusa et al. [22].
The implementation is somewhat similar to the naive add-wins set, as it keeps track of every add and rem operations in two sets: a set of add instances A with pairs (a, t), where a is an element and t is a timestamp, and a set of remove instances T with the same type of pairs. The two sets are disjoint. The read method returns an element a if there is an add(a) instance and no rem(a) instance. The rem method simply creates a new remove instance. The add operation turns any remove instances of the element into add instances, or, if there is none, creates a new add instance. The replication protocol is similar to the naive add-wins set.
To our surprise, the overhead of the remove-wins implementation is even higher than that of the naive add-wins implementation. Recall that the naive add-wins implementation stored only timestamps of removed elements, not their value. The implementation of remove-wins semantics must store also the value of removed elements, because removes must dominate concurrent adds unaware of their timestamps. Since removed elements are of variable, unbounded size, there is no upper bound w.r.t. m or n.4 Formally, we can give only an underestimated lower bound.

Background Compaction of Stable Metadata

Our definition of maximum metadata overhead (Definition 3.1) concerns the worst situation in an execution. This worst case may involve relatively short-lived peaks in metadata and only a subset of replicas. This perspective is important as it models what could happen under high concurrency or failure scenarios. For instance, the case where a set of replicas is partitioned away, or a replica becomes unresponsive. It shows the capacity that the system should be planned for. Nevertheless, there exist RDT implementations that can compact, or specifically garbage collect (GC), metadata that is not useful anymore. In particular, stable updates [56, 93] can often be collected. An update is stable if it is known to be replicated at all replicas. For example, in a set implementation, a stable tombstone can be discarded. When all replicas have received a given tombstone instance, then it is guaranteed that no further message will include the tombstone, except for a possibly delayed message. If furthermore, the implementation protects itself from delayed messages, a replica can discard the stable tombstone safely. Johnson and Thomas [56] use this protocol to discard timestamps of removed entries in a LWW map, whereas Wuu and Bernstein [102] use it for a map where elements are guaranteed to be added no more than once. A list implementation by Roh et al. [85] also uses stability to collect removed list elements.
A stability protocol computes stable updates that can be compacted [51, 56, 102]. The stability protocol maintains information about the set of updates that each replica has received. Such information is typically encoded as a one-dimensional vector or a two-dimensional matrix. Stability-based metadata compaction has its own drawbacks. The data structures of the stability protocol incur an overhead on their own [56, 102]. It takes time until stability is detected by the protocol. Finally, stability protocol is not live in the presence of failures, since it requires to communicate with every replica. Our model assumes a static or grow-only set of replicas and does not explicitly consider their failures. However, even under a model that explicitly considers dynamic set of replicas and their failures, it requires a perfect failure detector to safely eliminate an unresponsive replica from the set of correct replicas that must acknowledge an update.1 In contrast, with our metadata design from Section 3.2.1, an implementation discards information right away, independent of network conditions, and does not require additional GC. Nevertheless, for some data types (e.g., the remove-wins set or the list), our optimizations are not applicable and the GC approach is the only way to decrease metadata size.


Table of contents :

List of Tables
List of Figures
I Preliminaries 
1 Introduction 
1.1 Contributions
1.1.1 Optimality of Replicated Data Types
1.1.2 Causally-Consistent Object Database for Client-Side Applications
1.2 Organization
1.3 Authorship and Published Results
2 Replicated Data Types 
2.1 Motivation
2.2 System Model and Object Implementation
2.2.1 Object Implementation Model
2.2.2 Replication
2.2.3 Examples Counter Implementations Register Implementations
2.3 Specification of Intended Behavior
2.3.1 Specification Model
2.3.2 Examples Counter Specification Register Specifications Set Specifications
2.4 Execution Model and Correctness
2.4.1 Execution Model
2.4.2 Implementation Correctness
2.4.3 Implementation Categories Network Specifications Visibility Witnesses Main Categories
II Optimality of Replicated Data Types 
3 Metadata Space Complexity Problem 
3.1 Problem Statement
3.2 Optimizing Implementations
3.2.1 Successful Optimizations Add-Wins Set Multi-Value Register
3.2.2 Prior Implementations and Unsuccessful Optimizations Last-Writer-Wins Register Remove-Wins Set Last-Writer-Wins Set Counter
3.3 Summary
4 Lower Bounds on Complexity and Implementation Optimality 
4.1 Proof Technique
4.1.1 Experiment Family
4.1.2 Driver Programs
4.2 Lower Bounds
4.2.1 Counter
4.2.2 Add-Wins Set
4.2.3 Remove-Wins Set
4.2.4 Last-Writer-Wins Set
4.2.5 Multi-Value Register
4.2.6 Last-Writer-Wins Register
4.3 Summary
5 Related Work and Discussion 
5.1 Other Data Types
5.2 State-Based Optimizations Beyond Our Metric
5.2.1 Background Compaction of Stable Metadata
5.2.2 Finer-Grained Optimizations
5.2.3 Custom Timestamps
5.3 Other Implementation Categories
5.3.1 State-Based Implementations With Smaller Messages
5.3.2 Optimizations Based on Topology Restrictions and Delayed Visibility
5.3.3 Replicated File Systems
5.4 Lower Bound Proofs in Distributed Computing
III Causally-Consistent Object Database for Client-Side Applications 
6 Problem Overview 
6.1 System Model and Basic Requirements
6.2 Consistency with Convergence
6.2.1 Causal Consistency
6.2.2 Convergence with Replicated Data Types
6.3 Application Programming Interface
6.4 Challenge
6.4.1 Metadata Design
6.4.2 Causal Consistency with Partial Replication is Hard
7 The SwiftCloud Approach 
7.1 Design
7.1.1 Causal Consistency at Full Data Center Replicas
7.1.2 Causal Consistency at Partial Client Replicas
7.1.3 Failing Over: The Issue with Transitive Causal Dependency Conservative Read: Possibly Stale, But Safe Discussion
7.2 Implementation
7.2.1 Timestamps, Vectors and Log Merge
7.2.2 Protocols State Client-Side Execution Transfer Protocol: Client to Data Center Geo-replication Protocol: Data Center to Data Center Notification Protocol: Data Center to Client
7.2.3 Object Checkpoints and Log Pruning Log Pruning in the Data Center Pruning the Client’s Log
8 Experimental Evaluation 
8.1 Prototype and Applications
8.2 Experimental Setup
8.3 Experimental Results
8.3.1 Response Time and Throughput
8.3.2 Scalability
8.3.3 Tolerating Client Churn
8.3.4 Tolerating Data Center Failures
8.3.5 Staleness Cost
9 Related Work 
9.1 Consistency Models for High Availability
9.2 Relevant Systems
9.2.1 Replicated Databases for Client-Side Applications Systems that Support Inter-Object Consistency Systems that Support Intra-Object Consistency Only Session Guarantees
9.2.2 Geo-replicated Databases for Server-Side Applications Approaches Comparison and Applicability to Client-Side Replication
10 Conclusion 
10.1 Summary
10.2 Limitations and Perspectives
IV Appendix 
A Additional Material on Replicated Data Types 
A.1 Formal Network Layer Specifications
A.2 Optimized Op-Based Implementations
B Metadata Overhead Proofs 
B.1 Standard Encoding
B.2 Metadata Overhead of Specific Implementations
B.3 Lower Bound Proofs
B.3.1 Add-Wins Set
B.3.2 Remove-Wins Set
B.3.3 Last-Writer-Wins Set
B.3.4 Multi-Value Register
B.3.5 Last-Writer-Wins Register
C Résumé de la thèse 
C.1 L’optimalité des types les données répliquées
C.2 Une base de données causalement cohérente pour les applications coté client
C.2.1 Présentation du problème
C.2.1.1 La cohérence et la convergence
C.2.1.2 La conception des métadonnées
C.2.1.3 La cohérence causale avec une réplication partielle est dur
C.2.2 L’approche SwiftCloud
C.2.2.1 Cohèrence causale dans les répliques complètes des Centre de Données
C.2.2.2 Cohèrence causale dans les répliques client partielles
C.2.2.3 Le basculement sur erreur: Le problème avec la dépendance causale transitive
C.2.2.4 Protocoles avec les métadonnées découplées et délimitées .
C.2.3 La mise en oeuvre et l’évaluation


Related Posts