Causally-Consistent Object Database for Client-Side Applications

Get Complete Project Material File(s) Now! »

System Model and Object Implementation

In this section, we formalize an implementation model of Replicated Data Type. We use the model of Burckhardt et al. [34] that can express existing implementations [16, 32, 85, 92].
Figure 2.2 gives an informal overview of the system components involved in our model. We consider a replicated database that hosts one or more named objects on a set of replicas. We assume that the database implements each object independently. In other words, a database consists of a single object, without loss of generality.4 The type signature of an object defines its interface. An object implementation defines its behavior, including its replication protocol. An application has multiple processes that run client sessions. Each session issues a sequence of operations to a replica by calling methods; for simplicity, we assume that every session matches exactly one logical replica. A database replica delegates method calls to the corresponding object implementation. It responds by performing a local computation step. Local processing ensures that all operations are highly available and responsive. Replicas communi-cate the background via a message passing network layer to exchange updates. An object implementation includes functions that produce and apply messages.

Object Implementation Model

The type ø of an object comprises a type signature tuple (Opø,Valø), which defines the client interface as a set of methods Opø (ranged over by o) and a set of return values Valø (ranged over by v). For simplicity, all methods share the same return value domain, with a special value ?2Valø for methods that do not return any value. Without loss of generality, we assume that every type has a single side-effect free query method read 2 Opø, which returns the value of an object. The remaining methods Opø \{read} are called update methods and are assumed to have side effects. For example, a counter type, noted Ctr, has a signature (OpCtr,ValCtr). Methods OpCtr = {inc,read}, respectively, increment and read the counter value. read returns a natural number, ValCtr = N[ {?}. Another example is an integer register type, noted LWWReg, with methods OpLWWReg = {write(a) | a 2 Z}[{read}. The latter returns integer values, ValCtr = Z[{?}. Note that we model arguments of a method as part of a method name for simplicity. An object is instantiated from an implementation class of some type, to be defined shortly. An object is hosted on a set of database replicas ReplicaID (ranged over by r). We assume that the database can provide the object implementation with unique timestamps from a domain Timestamp (ranged over t), totally ordered by <. Scalar clocks, such as Lamport’s logical clock [64], or real-time clock concatenated with a unique replica ID, can serve this purpose. For simplicity, we often assume integer timestamps.

Specification of Intended Behavior

The primary expectation behind any replicated object is convergence. Category-specific convergence conditions can be successfully applied to implementations [92], including the examples from the previous section. However, convergence alone is not sufficient. Towards what value the object converges, and what are the intermediate states and their values, is also important. For instance, an increment-only counter that provably converges towards 0, or occasionally returns value 0 before it converges, does not provide the intended behavior.
A precise specification of type semantics can act as a contract between an application and a data type implementation. There can be multiple implementations of the same type, including different optimizations or models (e.g., Algorithm 2.2 vs. Algorithm 2.3). It would be natural to express that these implementations behave in the same way. Implementation is often not an effective way of expressing semantics, since it requires considering low-level details of implementation and execution (consider, for example, Algorithm 2.5, or an execution from Figure 2.4).
In this section, we define and illustrate a declarative RDT specification model, after Burckhardt et al. [34]. This form of specification offers a unified and concise representation of RDT semantics, that embodies the convergence condition.

Specification Model

To introduce the specification model we rely on the reader’s intuitive understanding of the execution model from the previous section. At a high level, our goal is to define a minimal abstract description of each concrete execution that would express the behavior of an implementation, and that would justify its correctness. We give an overview of the components involved in the semantic description model and the definition of correctness condition in Figure 2.7. On the left side, we illustrate sets of concrete executions in the low-level implementation domain. The specification domain, on the right, is made of abstract executions, including a trace of client operations with additional relations. A data type specification is defined on abstract execution structure, and indirectly identifies a subset of abstract executions that satisfy the specification.
Our ultimate goal is to define a correctness condition for a data type implementation, that relates each concrete execution with a witness abstract execution that satisfies the specification. The client-observable behavior of an execution can be characterized by a trace that carries information on operations performed by each client session, the order in which they were issued by a client session, and their return values. The intended behavior concerns return values in a trace. However, a trace alone is insufficient to specify the intended behavior, because it is missing information about relation of operations performed in different client sessions (at different replicas). For example, it is impossible to determine if a counter implementation behaves correctly if the operations in the trace do not specify what increments were delivered to a replica of the client that performed the operation. A trace must be extended with additional information.

Execution Model and Correctness

In this section, we formalize a concrete execution model (Section 2.4.1), and relate concrete executions and abstract executions with an implementation correctness definition (Section 2.4.2), after Burckhardt et al. [34]. The correctness condition is parameterized with a pair of conditions, which leads us to a formal definition of some implementation categories (Section 2.4.3). The reader interested only in Part III of the thesis may proceed directly to Section 2.4.3.

Prior Implementations and Unsuccessful Optimizations

A number of other important data types exist, for which we did not manage to find a correct optimization. Some of them provide a similar behavior to add-wins set and multi-value register, and could act as their replacement. We report on their complexity here, which motivates some of the questions regarding optimality that we address in Chapter 4.

Last-Writer-Wins Register

In the presence of nonnegligible overhead of multi-value register implementation, it is natural to investigate the complexity of arbitration-based last-writer-wins register. LWWregister is specified as FLWWReg in Equation 2.6. We presented the implementation in Section 2.2.3 (Algorithm 2.4).

Remove-Wins Set

Given that implementations of the add-wins set have noticeable overhead, it is natural to consider other semantics of sets. Algorithm 3.5 is an implementation of remove-wins set semantics FRWSet (Equation 2.10) by Bieniusa et al. [22].
The implementation is somewhat similar to the naive add-wins set, as it keeps track of every add and rem operations in two sets: a set of add instances A with pairs (a, t), where a is an element and t is a timestamp, and a set of remove instances T with the same type of pairs. The two sets are disjoint. The read method returns an element a if there is an add(a) instance and no rem(a) instance. The rem method simply creates a new remove instance. The add operation turns any remove instances of the element into add instances, or, if there is none, creates a new add instance. The replication protocol is similar to the naive add-wins set.
To our surprise, the overhead of the remove-wins implementation is even higher than that of the naive add-wins implementation. Recall that the naive add-wins implementation stored only timestamps of removed elements, not their value. The implementation of remove-wins semantics must store also the value of removed elements, because removes must dominate concurrent adds unaware of their timestamps. Since removed elements are of variable, unbounded size, there is no upper bound w.r.t. m or n.4 Formally, we can give only an underestimated lower bound.

READ Disturbance compensation based controller - the selected approach

Driver Programs

To simplify the presentation, we avoid specifying an experiment family directly as a sequence of concrete execution steps, by defining it indirectly using a driver program that generate these execution steps (e.g., see Table 4.1). A driver program is written in imperative pseudocode with three types of instructions, each of which triggers a uniquely-determined configuration transition:
• v√dor o t Do operation o at replica r with timestamp t, assign the return value to v.
• sendr(mid) Send a message with identifier mid at replica r.
• deliverr(mid) Deliver the message with identifier mid at replica r.
We structure code into procedures, and define a program as a sequence of procedure calls. When a driver program terminates, it may produce return value from the last procedure. For a program P, an implementation Dø, and a configuration (R,N), we let exec(Dø,(R,N),P) be the concrete execution of the data type implementation Dø starting in (R,N) that results from running P; we define result(Dø,(R,N),P) as the return value of P in this run.
Programs explicitly supply timestamps for the do instruction and message identifiers for send and deliver instructions. We require that they do this correctly, i.e., that they respect the uniqueness of timestamps, message IDs, and deliver only previously-sent messages. This is the case for all our driver programs.1 For most of our programs, timestamps (arbitration) are irrelevant, as they concern visibility-based data types, but we include them for completeness and make use of them for arbitration-based data types.

Table of contents :

List of Tables
List of Figures
I Preliminaries
1 Introduction
1.1 Contributions
1.1.1 Optimality of Replicated Data Types
1.1.2 Causally-Consistent Object Database for Client-Side Applications
1.2 Organization
1.3 Authorship and Published Results
2 Replicated Data Types
2.1 Motivation
2.2 System Model and Object Implementation
2.2.1 Object Implementation Model
2.2.2 Replication
2.2.3 Examples
2.2.3.1 Counter Implementations
2.2.3.2 Register Implementations
2.3 Specification of Intended Behavior
2.3.1 Specification Model
2.3.2 Examples
2.3.2.1 Counter Specification
2.3.2.2 Register Specifications
2.3.2.3 Set Specifications
2.4 Execution Model and Correctness
2.4.1 Execution Model
2.4.2 Implementation Correctness
2.4.3 Implementation Categories
2.4.3.1 Network Specifications
2.4.3.2 Visibility Witnesses
2.4.3.3 Main Categories
II Optimality of Replicated Data Types
3 Metadata Space Complexity Problem
3.1 Problem Statement
3.2 Optimizing Implementations
3.2.1 Successful Optimizations
3.2.1.1 Add-Wins Set
3.2.1.2 Multi-Value Register
3.2.2 Prior Implementations and Unsuccessful Optimizations
3.2.2.1 Last-Writer-Wins Register
3.2.2.2 Remove-Wins Set
3.2.2.3 Last-Writer-Wins Set
3.2.2.4 Counter
3.3 Summary
4 Lower Bounds on Complexity and Implementation Optimality
4.1 Proof Technique
4.1.1 Experiment Family
4.1.2 Driver Programs
4.2 Lower Bounds
4.2.1 Counter
4.2.2 Add-Wins Set
4.2.3 Remove-Wins Set
4.2.4 Last-Writer-Wins Set
4.2.5 Multi-Value Register
4.2.6 Last-Writer-Wins Register
4.3 Summary
5 Related Work and Discussion
5.1 Other Data Types
5.2 State-Based Optimizations Beyond Our Metric
5.2.1 Background Compaction of Stable Metadata
5.2.2 Finer-Grained Optimizations
5.2.3 Custom Timestamps
5.3 Other Implementation Categories
5.3.1 State-Based Implementations With Smaller Messages
5.3.2 Optimizations Based on Topology Restrictions and Delayed Visibility
5.3.3 Replicated File Systems
5.4 Lower Bound Proofs in Distributed Computing
III Causally-Consistent Object Database for Client-Side Applications
6 Problem Overview
6.1 System Model and Basic Requirements
6.2 Consistency with Convergence
6.2.1 Causal Consistency
6.2.2 Convergence with Replicated Data Types
6.3 Application Programming Interface
6.4 Challenge
6.4.1 Metadata Design
6.4.2 Causal Consistency with Partial Replication is Hard
7 The SwiftCloud Approach
7.1 Design
7.1.1 Causal Consistency at Full Data Center Replicas
7.1.2 Causal Consistency at Partial Client Replicas
7.1.3 Failing Over: The Issue with Transitive Causal Dependency
7.1.3.1 Conservative Read: Possibly Stale, But Safe
7.1.3.2 Discussion
7.2 Implementation
7.2.1 Timestamps, Vectors and Log Merge
7.2.2 Protocols
7.2.2.1 State
7.2.2.2 Client-Side Execution
7.2.2.3 Transfer Protocol: Client to Data Center
7.2.2.4 Geo-replication Protocol: Data Center to Data Center
7.2.2.5 Notification Protocol: Data Center to Client
7.2.3 Object Checkpoints and Log Pruning
7.2.3.1 Log Pruning in the Data Center
7.2.3.2 Pruning the Client’s Log
8 Experimental Evaluation
8.1 Prototype and Applications
8.2 Experimental Setup
8.3 Experimental Results
8.3.1 Response Time and Throughput
8.3.2 Scalability
8.3.3 Tolerating Client Churn
8.3.4 Tolerating Data Center Failures
8.3.5 Staleness Cost
9 Related Work
9.1 Consistency Models for High Availability
9.2 Relevant Systems
9.2.1 Replicated Databases for Client-Side Applications
9.2.1.1 Systems that Support Inter-Object Consistency
9.2.1.2 Systems that Support Intra-Object Consistency Only
9.2.1.3 Session Guarantees
9.2.2 Geo-replicated Databases for Server-Side Applications
9.2.2.1 Approaches
9.2.2.2 Comparison and Applicability to Client-Side Replication .
10 Conclusion
10.1 Summary
10.2 Limitations and Perspectives
IV Appendix
A Additional Material on Replicated Data Types
A.1 Formal Network Layer Specifications
A.2 Optimized Op-Based Implementations
B Metadata Overhead Proofs
B.1 Standard Encoding
B.2 Metadata Overhead of Specific Implementations
B.3 Lower Bound Proofs
B.3.1 Add-Wins Set
B.3.2 Remove-Wins Set
B.3.3 Last-Writer-Wins Set
B.3.4 Multi-Value Register
B.3.5 Last-Writer-Wins Register
C Résumé de la thèse
C.1 L’optimalité des types les données répliquées
C.2 Une base de données causalement cohérente pour les applications coté client
C.2.1 Présentation du problème
C.2.1.1 La cohérence et la convergence
C.2.1.2 La conception des métadonnées
C.2.1.3 La cohérence causale avec une réplication partielle est dur
C.2.2 L’approche SwiftCloud
C.2.2.1 Cohèrence causale dans les répliques complètes des Centre de Données
C.2.2.2 Cohèrence causale dans les répliques client partielles
C.2.2.3 Le basculement sur erreur: Le problème avec la dépendance causale transitive
C.2.2.4 Protocoles avec les métadonnées découplées et délimitées .
C.2.3 La mise en oeuvre et l’évaluation
Bibliography