Lecture 9: Eventual Consistency

Background: Logical clocks

Problem: How to order events in a distributed system when real-time
clocks are not synchronized?

We first need to introduce a notion of ordering before we can
order anything. The *happened-before relation* on the set of events in a
distributed system is the smallest relation satisfying:

1) If a and b are two events in the same process, and a comes before b, then a -> b
   "a happened before b"
2) If a denotes the sending of a message and b the receipt of that message, 
   then a -> b
3) If a -> b and b -> c, then a -> c.

Note: This relation defines a partial ordering of events in a system with
concurrently executing processes.

Logical Clocks (a.k.a. Lamport Clocks)

Problem: How do we maintain a global view of the system's behavior
that is consistent with the happened-before relationship?

Solution: attach a timestamp C(e) to each event e, satisfying the
following properties:

P1: If a and b are two events in the same process and a -> b, then C(a) < C(b)
P2: If a corresponds to sending a message m, and b to the receipt of m, 
    then C(a) < C(b)

Problem: How to attach a timestamp to an event when there is no global
clock? => maintain a consistent set of logical clocks, one for each
node.

Each node Ni maintains a local counter Ci, and adjusts this counter
according to the following rules:

(1) Between any two successive events within Ni, increment Ci by one.
(2) When a message m is sent by Ni, the message receives a timestamp Tm <- Ci.
(3) When a message m is received by node Nj, Cj is adjusted as Cj <- max{Cj, Tm+1}.

Property P1 is satisfied by (1); property P2 by (2) and (3).

      
How can we achieve a total order of events with logical clocks?

It is possible that two events happen at the same (logical)
time. Avoid this by attaching a node number to an event (Node id is
used as a tie-breaker):

Pi timestamps each event e with Ci(e).i 
  (Ci(e) concatenated with i's unique node identifier)

Then: Ci(a).i < Cj(b).j ("." indicates concatenation) iff

- Ci(a) < Cj(a), or
- Ci(a) = Cj(b) and i < j


---------------------

Managing Update Conflicts in Bayou,
a Weakly Connected Replicated Storage System
Terry, Theimer, Petersen, Demers, Spreitzer, Hauser

[note that ordering the updates is described in a different paper]

Let's build a calendar system to help understand ordering and conflicts.
 Each entry has a time and a set of participants.
 We want everyone to end up seeing the same set of entries.

Traditional approach: one calendar server.
 Ordering: only one copy, server picks the order.
 Conflict resolution: server checks for conflicts before accepting update.
   Returns error to user, who can handle it in any way desired.

Why aren't we satisfied with central server?
 I want my calendar on my PDA/smart phone.
   I.e. database replicated in every node.
   No master copy.
 Periodic connectivity to net.
 Periodic Infrared/BlueTooth contact with other calendar users.

Straw man 1: swap complete DBs.
 Similar to Palm Pilot sync.
 Might require lots of network b/w.
 What if there's a conflict? I.e., two meetings at same time.
   Palm just schedules them both!
   But we want automatic conflict resolution.
 We can't just view DB items as bits.
   Since then we can't resolve conflicts.
 We want intelligent DB items that know how to resolve conflicts.
   They are more like updates: read DB, think, make a change.
   But we have to make sure that all nodes resolve conflicts the same way.
   How?

Insight:
 Maintain an ordered list of updates at each node.
 Make sure every node has the same updates.
 Make sure every node applies the updates in the same order.
 Make sure the updates are deterministic functions of DB contents.
 Then a "sync" is really a merge of two ordered lists: easy.
   Not a DB merge at all.

What are the write log entries?
 Why not just "10am meeting with Robert and Frans"?
   Because we can't resolve conflicts.
 Instead, "1-hour meeting w/ Robert and Frans at 9, otherwise 10, otherwise 11."
   Along with a unique ID: <local-time-stamp, originating-node>
   This is really instructions for doing a write, not the written data.
   So the write log is really an instruction in the distributed calendar program.
     We want all nodes to run same instructions in same order. Eventually.

Example:
 <701,A>: Node A asks for meeting M1 to occur at 10am, otherwise 11am.
 <770,B>: Node B asks for meeting M2 to occur at 10am, otherwise 11am.
 Let's agree to sort by write ID (e.g. <701,A>).
 As "writes" spread, nodes may initially apply updates in different order.
 Each newly seen write is merged into the log,
   and the log is replayed,
   which may cause the calendar displayed to the user to change.
   I.e. all entries are really "tentative", nothing is stable.
 But when everybody has seen all the writes, everybody will agree.

Global time sync is not possible.
 Does that make this particular scheme impossible?
 No. We're just using time stamps to allow agreement on order.
   Doesn't matter if node clocks are wrong.
   As long as users don't depend on reasoning about real time.
 But this scheme arbitrarily constrains order.
   You never know if there's some write from the past you haven't seen.
   So all entries must be tentative forever.
   And you have to keep the log around forever.

How can we allow a notion of committing a tentative entry?
 So we can have the meetings! And trim the logs.
 For an entry X to be committed, everyone must agree on:
   The total order of all previous committed entries.
   The fact that X is the next in the total order.
   The fact that all uncommitted entries are "after" X.

How does Bayou agree on total order of committed writes?
 One designated "primary replica".
 It marks each write it receives with a permanent CSN.
   Commit Sequence Number.
   That write is committed.
   So a complete time stamp is <CSN, local-TS, node-id>
 CSN notifications are exchanged between nodes.
 The CSNs define a total order for committed writes.
   All nodes will eventually agree on it.
   Uncommitted writes come after all committed writes.

Do we now know enough to show user "committed" flag with entries?
 Not quite.
 Entire log up to that committed entry must be stable.
   Otherwise there might be earlier committed write we don't know about.
   And we'd have to re-run conflict resolution.
 So a committed write isn't stable unless we've seen all prior committed writes.
 Bayou update propagation protocol maintains this.
   Propagates writes in order.

Now DB entries can be shown to user as committed.
 Which means everyone does (or will) agree on them.
 And a slow or disconnected node cannot prevent commits.
 Nodes may still disagree on the meaning of uncommitted writes, though.
   Even if they have synced with each other!
     *** well, only if one then sees a CSN that the other doesn't see.
   Only arrival of CSNs will resolve.
   EXAMPLE
     Roll-back, or start at stable DB as of highest CSN seen?

Now nodes can discard log entries with CSNs.
 As long as they've seen every CSN up to that point.
   (Turns out to be guaranteed by propagation protocol.)
 Instead, keep a copy of the database as of the highest CSN.
 That *is* the offical committed database.
   Everybody does (or will) agree on it.
   Its entries will never need to have conflict resolution.
 So you don't need to keep years of calendar write log entries.

Is is OK if primary replica can choose *any* order to commit?
 Suppose I schedule an event, then want to delete it?
   Or change attendee list?
 The create must precede the delete in the CSN order!
   And in every node's view of uncommitted part of log too.
 Total order must preserve order of writes originated at each node.
   But not necessarily order among different nodes' writes.
 How can primary replica be sure it commits each node's writes in order?
 1) Nodes actually use Lamport logical time-stamps for local TS.
 2) Everybody sends updates in order.
 So primary sees updates in causal order, and commits them in that order.

How do I propagate if I've discarded part of my log?
 Suppose I've discarded *all* writes with CSNs.
 I keep a copy of the stable DB reflecting just discarded entries.
 First, I know I cannot receive writes that conflict.
   Could only happen if write has a CSN < one discarded.
   But I already saw it, in the right order, so can ignore.
 When I propagate to node X:
   If node X's highest CSN is less than mine,
     I can send him my stable DB reflecting just committed writes.
     Node X can use my DB as starting point.
     And X can discard all CSN log entries.
     Then play his tentative writes into that DB.
   If node X's highest CSN is greater than mine,
     X can ignore my DB.

Evaluation
 Seems much more functional than Palm Pilot calendar.
 *Not* transparent to applications!
   Requires very strange programming practices.
   Every "write" is actually a bunch of code, not the new bits.
   Check conflicts (meeting already scheduled for any attendee?).
   Resolve conflicts (choose a different unused meeting time?).
 Not every application has appropriate semantics.
   Might actually work for a bank account!
   But conflicts can't always be automatically resolved.
     For example, changes to source code from multiple programmers.