Lecture 13: Chain replication General topic: Fault tolerance Multiple replicas Service even if one or more replicas fail Earlier lectures: Crash recovery Recover from failure No service during recovery Last lecture: State machine replication N-way replication Service even while one or more replicas recover Fail-stop assumption (no need for consensus) Today's lecture: t-way replication for storage service Fail-stop assumption (no need for consensus) Availability while failed replicas recover Strong consistency Goal: Storage service with query (read) and update (write) operations on /individual objects/ Simultaneously attain: High throughput Availability despite failure Strong consistency In fail-stop model, any two of these three are easy to attain. High throughput + availability: Google File System (GFS) - Updates to primary, but no lock for readers - With failures, writes may be applied to only some nodes. - Readers have to be careful. Availability + strong consistency: Updates go to primary, which keeps lock during update. Replies only after all replicas apply changes. On failure of primary, make another replica primary. Strong consistency + high throughput: Don't replicate. Shard to scale throughput. How to attain all three? ---------------- Chain replication ---------------- Key building block: Requirement is strong consistency, not strict consistency. Read/write requests can be ordered almost arbitrarily. Assume that it is okay to drop some requests ocassionally (client will retry) Precise property attained: - Concurrent query and update operations on each object totally ordered by service - Effects of update operations necessarily reflected in query operations later in this order (The above do not imply sequential consistency since they do not prevent requests from the same client from being reordered, but the actual design does provide sequential consistency) Basic idea: For each object, t replicas chosen (t is the replication factor, e.g, 1-10). Denote replicas R(1) ... R(t). Replicas organized in a chain: R(1) -> R(2) -> ... -> R(t) R(1) called "head", R(t) called "tail" Update requests: All update requests on object go to R(1) of the object. R(1) receives update request R(1) computes object changes, computes delta. R(1) applies change, pushes delta to R(2). R(2) applies change, pushes delta to R(3). ... R(t) applies change. R(t) responds to client. Updates "ripple" along chain from R(1) to R(t). Multiple updates are pushed in order at each step. With many updates, pipeline effect. Throughput = peak throughput of single server (slowest is usually R(1)). Key invariant: At any point, for i < j, R(j) has applied a prefix of updates that R(i) has applied. Query requests: All query requests on object go to R(t) of the object. R(t) serves request from current local state. R(t) may not have seen all updates that have been received by R(1). However, this is okay for the strong consistency property we want. If R(t) gets a read request before an earlier update request received by R(1) has propagated to it, then the two requests are concurrent (since only R(t) could have responded to the update request). So, it is okay to re-order them. Q: Who establishes the total order here? A: Total order is established at R(t). Updates themselves are totally ordered at R(1). Q: Why high throughput? A: Queries handled locally at R(t). Same throughput as that of single server. Updates received at R(1), but R(1) does not wait for update to propagate. Hence, updates take advantage of pipeline. Q: What are the negatives for performance? A: Update /latency/ is high; must wait for propagation from R(1) to R(t). In primary/backup setup, all backups can update in parallel. Q: Paper claims (page 6, end of column 1) that the sharing of responsibility between R(1) and R(t) enables lower-latency and lower-overhead processing for queries as compared to primary/backup approach. How true is this? A: This is true if queries don't want to read previous updates. When this isn't the case, then queries must suffer the full latency of previous updates. In practice the assumption may often be true. Q: Suppose we want to scale query (read) throughput beyond one server. Can we use both R(t-1) and R(t) to serve queries? A: This gives only weak consistency. Consider the case where an update has propagated to R(t-1) but not R(t). Another thread issues query q1, which goes to R(t-1) and returns with the update. The thread next issues query q2, which goes to R(t) and is returned without the update. This breaks strong consistency, since q1 must be ordered before q2 (since q1 and q2 are sequential quries---q1 returned before q2 was issued), but q2 does not see an update that q1 did. Sequential consistency can be regained if all requests from a single client go to a single replica. However, unclear what to do in case of failure. The protocol's behavior can be described abstractly in terms of a simple state machine (see Fig. 1 of paper). Performance (Section 5.1) Figure 4 shows throughput as function of number of replicas and of the percentage of update requests (relative to total update and query requests). Three configurations: chain replication, primary/backup setup and chain replication where queries go any node in the chain (with weak consistency), called "weak-chain" or "weak". Q: For chain replication and primary/backup, why are the throughput vs update-% curves identical for all replication factors? A: Replication has no impact on throughput due to pipelining in both settings. Q: Why does throughput decrease as the percentage of update operations increases? A: Because updates are assumed to be more expensive than queries in these simulations. Q: Why does weak-chain perform much better than primary/backup and chain only when update-% is low? A: Weak chain only parallelizes query requests. Failures: There is a separate master. Assumed not to fail in principle; in practice, implemented via replication and consensus (next lecture). Fail-stop assumption on replicas: - On failure, replicas just stop. - Can be detected by master. Failure recovery depends on what fails. Head (R(1) fails): Master designates R(2) as new head by sending it a message, and broadcasts new head to all clients. Any pending updates on R(1) are lost. Assumption: Clients can handle silent failures of updates (timeouts). Tail (R(t) fails): Master designates R(t-1) as new tail but sending it a message, and broadcasts new tail to all clients. Any pending queries or responses on R(t) are lost. Assumption: Clients can handle silent failures of all requests (timeouts). Both these operations don't affect chain invariants or strong consistency. Failure in middle of chain (R(i) fails where i <> 1 and i <> t): Master removes R(i) from chain by sending messages to R(i-1) and R(i+1). Problem: What if R(i) had seen updates that it hadn't forwarded to R(i+1). Now, R(i-1) can't forward any more updates to R(i+1), without also forwarding those missing updates. How does R(i-1) know what the missing updates are? Solution (strawman): Each replica remembers all updates it has sent forward and the last update it has seen. In this case, R(i+1) can tell R(i-1) what last update it has seen and R(i-1) can send it all missing updates. Problem: Needs unbounded history! Solution: When tail R(t) has applied update k, it sends ack(k) back to R(t-1), which sends it to R(t-2) and so on. If R(i) sees ack(k), it deletes update k from its history since every node downstream has seen the update. Protocol for R(i) failure, when i <> 1 and i <> t: Master detects R(i) failed Master -> R(i+1): New predecessor is R(i-1); what last update have you seen? R(i+1) -> Master: Last update = k Master -> R(i-1): New successor is R(i+1); send updates after k R(i-1) -> R(i+1): All updates after k Chain works normally after this Chain extension: When chain gets too short, extend it with new servers. Easiest to extend by adding new tail. New tail R(t+1) added. R(t) sends its current state to R(t+1). Starts forwarding updates to R(t+1). Clients informed of change. Q: When can R(t+1) start handling queries? A: After it has seen the last update that has been followed by a completed query.