Lecture 13: Chain replication

General topic: Fault tolerance
 Multiple replicas
 Service even if one or more replicas fail

Earlier lectures:
 Crash recovery
 Recover from failure
 No service during recovery

Last lecture:
 State machine replication
 N-way replication
 Service even while one or more replicas recover
 Fail-stop assumption (no need for consensus)

Today's lecture:
 t-way replication for storage service
 Fail-stop assumption (no need for consensus)
 Availability while failed replicas recover
 Strong consistency

Goal:

 Storage service with query (read) and update (write) operations
 on /individual objects/

 Simultaneously attain:
  High throughput
  Availability despite failure
  Strong consistency

 In fail-stop model, any two of these three are easy to attain.

 High throughput + availability: Google File System (GFS)
  - Updates to primary, but no lock for readers
  - With failures, writes may be applied to only some nodes.
  - Readers have to be careful.

 Availability + strong consistency: Updates go to primary, which keeps
  lock during update. Replies only after all replicas apply changes.
  On failure of primary, make another replica primary.
 
 Strong consistency + high throughput:
  Don't replicate. Shard to scale throughput.

 How to attain all three?

----------------
Chain replication
----------------

Key building block: Requirement is strong consistency, not strict
consistency. Read/write requests can be ordered almost arbitrarily.

Assume that it is okay to drop some requests ocassionally (client will
retry)

Precise property attained:

- Concurrent query and update operations on each object totally
  ordered by service

- Effects of update operations necessarily reflected in query
  operations later in this order

(The above do not imply sequential consistency since they do not
prevent requests from the same client from being reordered, but the
actual design does provide sequential consistency)

Basic idea:

For each object, t replicas chosen (t is the replication factor, e.g,
1-10). Denote replicas R(1) ... R(t).

Replicas organized in a chain:

R(1) -> R(2) -> ... -> R(t)

R(1) called "head", R(t) called "tail"

Update requests: All update requests on object go to R(1) of the object.
 R(1) receives update request
 R(1) computes object changes, computes delta.
 R(1) applies change, pushes delta to R(2).
 R(2) applies change, pushes delta to R(3).
 ...
 R(t) applies change.
 R(t) responds to client. 
 
 Updates "ripple" along chain from R(1) to R(t).

 Multiple updates are pushed in order at each step.

 With many updates, pipeline effect. Throughput = peak throughput of
 single server (slowest is usually R(1)).
 
Key invariant: At any point, for i < j, R(j) has applied a prefix of
updates that R(i) has applied.

Query requests: All query requests on object go to R(t) of the object.

 R(t) serves request from current local state.

 R(t) may not have seen all updates that have been received by
 R(1). However, this is okay for the strong consistency property we
 want. If R(t) gets a read request before an earlier update request
 received by R(1) has propagated to it, then the two requests are
 concurrent (since only R(t) could have responded to the update
 request). So, it is okay to re-order them.

Q: Who establishes the total order here?

A: Total order is established at R(t). Updates themselves are totally
ordered at R(1).

Q: Why high throughput?

A: Queries handled locally at R(t). Same throughput as that of single
server. Updates received at R(1), but R(1) does not wait for update to
propagate. Hence, updates take advantage of pipeline. 

Q: What are the negatives for performance?

A: Update /latency/ is high; must wait for propagation from R(1) to
R(t). In primary/backup setup, all backups can update in parallel.

Q: Paper claims (page 6, end of column 1) that the sharing of
responsibility between R(1) and R(t) enables lower-latency and
lower-overhead processing for queries as compared to primary/backup
approach. How true is this?

A: This is true if queries don't want to read previous updates. When
this isn't the case, then queries must suffer the full latency of
previous updates. In practice the assumption may often be true.

Q: Suppose we want to scale query (read) throughput beyond one
server. Can we use both R(t-1) and R(t) to serve queries?

A: This gives only weak consistency. Consider the case where an update
has propagated to R(t-1) but not R(t). Another thread issues query q1,
which goes to R(t-1) and returns with the update. The thread next
issues query q2, which goes to R(t) and is returned without the
update. This breaks strong consistency, since q1 must be ordered
before q2 (since q1 and q2 are sequential quries---q1 returned before
q2 was issued), but q2 does not see an update that q1 did.

Sequential consistency can be regained if all requests from a single
client go to a single replica. However, unclear what to do in case of
failure.

The protocol's behavior can be described abstractly in terms of a
simple state machine (see Fig. 1 of paper).


Performance (Section 5.1)

Figure 4 shows throughput as function of number of replicas and of the
percentage of update requests (relative to total update and query
requests).

Three configurations: chain replication, primary/backup setup and
chain replication where queries go any node in the chain (with weak
consistency), called "weak-chain" or "weak".

Q: For chain replication and primary/backup, why are the throughput vs
update-% curves identical for all replication factors?

A: Replication has no impact on throughput due to pipelining in both
settings.

Q: Why does throughput decrease as the percentage of update operations
increases?

A: Because updates are assumed to be more expensive than queries in
these simulations.

Q: Why does weak-chain perform much better than primary/backup and
chain only when update-% is low?

A: Weak chain only parallelizes query requests. 


Failures:

There is a separate master. Assumed not to fail in principle; in
practice, implemented via replication and consensus (next lecture).

Fail-stop assumption on replicas:

- On failure, replicas just stop.

- Can be detected by master.

Failure recovery depends on what fails.

Head (R(1) fails): Master designates R(2) as new head by sending it a
message, and broadcasts new head to all clients. Any pending updates
on R(1) are lost.

Assumption: Clients can handle silent failures of updates (timeouts).

Tail (R(t) fails): Master designates R(t-1) as new tail but sending it
a message, and broadcasts new tail to all clients. Any pending queries
or responses on R(t) are lost. 

Assumption: Clients can handle silent failures of all requests
(timeouts).

Both these operations don't affect chain invariants or strong
consistency.

Failure in middle of chain (R(i) fails where i <> 1 and i <> t):
Master removes R(i) from chain by sending messages to R(i-1) and
R(i+1).

Problem: What if R(i) had seen updates that it hadn't forwarded to
R(i+1). Now, R(i-1) can't forward any more updates to R(i+1), without
also forwarding those missing updates. How does R(i-1) know what the
missing updates are?

Solution (strawman): Each replica remembers all updates it has sent
forward and the last update it has seen. In this case, R(i+1) can tell
R(i-1) what last update it has seen and R(i-1) can send it all missing
updates.

Problem: Needs unbounded history!

Solution: When tail R(t) has applied update k, it sends ack(k) back to
R(t-1), which sends it to R(t-2) and so on. If R(i) sees ack(k), it
deletes update k from its history since every node downstream has seen
the update.

Protocol for R(i) failure, when i <> 1 and i <> t:

Master detects R(i) failed
Master -> R(i+1): New predecessor is R(i-1); what last update have you seen?
R(i+1) -> Master: Last update = k
Master -> R(i-1): New successor is R(i+1); send updates after k
R(i-1) -> R(i+1): All updates after k
Chain works normally after this

Chain extension:

When chain gets too short, extend it with new servers.
Easiest to extend by adding new tail.
New tail R(t+1) added.
R(t) sends its current state to R(t+1).
Starts forwarding updates to R(t+1).
Clients informed of change.

Q: When can R(t+1) start handling queries?

A: After it has seen the last update that has been followed by a
completed query.