Lecture 9: Eventual Consistency Background: Logical clocks Problem: How to order events in a distributed system when real-time clocks are not synchronized? We first need to introduce a notion of ordering before we can order anything. The *happened-before relation* on the set of events in a distributed system is the smallest relation satisfying: 1) If a and b are two events in the same process, and a comes before b, then a -> b "a happened before b" 2) If a denotes the sending of a message and b the receipt of that message, then a -> b 3) If a -> b and b -> c, then a -> c. Note: This relation defines a partial ordering of events in a system with concurrently executing processes. Logical Clocks (a.k.a. Lamport Clocks) Problem: How do we maintain a global view of the system's behavior that is consistent with the happened-before relationship? Solution: attach a timestamp C(e) to each event e, satisfying the following properties: P1: If a and b are two events in the same process and a -> b, then C(a) < C(b) P2: If a corresponds to sending a message m, and b to the receipt of m, then C(a) < C(b) Problem: How to attach a timestamp to an event when there is no global clock? => maintain a consistent set of logical clocks, one for each node. Each node Ni maintains a local counter Ci, and adjusts this counter according to the following rules: (1) Between any two successive events within Ni, increment Ci by one. (2) When a message m is sent by Ni, the message receives a timestamp Tm <- Ci. (3) When a message m is received by node Nj, Cj is adjusted as Cj <- max{Cj, Tm+1}. Property P1 is satisfied by (1); property P2 by (2) and (3). How can we achieve a total order of events with logical clocks? It is possible that two events happen at the same (logical) time. Avoid this by attaching a node number to an event (Node id is used as a tie-breaker): Pi timestamps each event e with Ci(e).i (Ci(e) concatenated with i's unique node identifier) Then: Ci(a).i < Cj(b).j ("." indicates concatenation) iff - Ci(a) < Cj(a), or - Ci(a) = Cj(b) and i < j --------------------- Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System Terry, Theimer, Petersen, Demers, Spreitzer, Hauser The commit protocol is described in: Flexible Update Propagation for Weakly Consistent Replication Petersen, Spreitzer, Terry, Theimer, Demers Example: A calendar system. Every user has a calendar app on their mobile device. Each entry has a time and a set of participants. We want everyone to end up seeing the same set of entries *eventually*. Don't worry about security/access control for now (everyone may see all entries) Traditional approach: A central server has the only copy. Every user read/write goes to it. Server orders updates from different users. Conflicts are possible; immediately detected and reported by server to user. What's missing from this approach? May not have continuous connectivity. Want to be able to work offline. => No master copy. => All user's phones must sync updates, either directly or via server. Straw man 1: When two phones meet, swap full databases. What to do if there's a conflict? Two meetings at the same time? Can't resolve conflicts at granularity of database. Need more information (which meeting came first?) Insight 1: Maintain an ordered list of updates at each node (called a write log). Swap updates when nodes connect to each other. Apply them. Find and resolve conflicts as they arise. Problem: Nodes may see updates in different orders and/or resolve them in different ways, and end up in different states. Example: A: Schedule meeting X at time t A -> B (B sees X) C: Schedule meeting Y at time t C -> D (D sees Y) B -> C (C sees both X and Y; puts Y first since it already applied it) D -> A (A sees both X and Y; puts X first since it already applied it) A and C think different meetings scheduled at time t! Insight 2: (a) The list order must be total: Given a collection of events, they must get ordered the same way. (b) Conflict resolution must be deterministic. (a), the total order, can be established by timestamping updates in the write log with Lamport clocks. Example revisited: A: <1, A> Schedule meeting X at time t A -> B: B sees <1, A> C: <1, C> Schedule meeting Y at time t C -> D: D sees <1, C> B -> C: C sees <1, C>, <1, A>. These get ordered <1, A>, <1, C>. Since C has already applied <1, C>, it must *undo* that. Then replay <1, A>, <1, C>. D -> A: A sees <1, A>, <1, C>. These get ordered <1, A>, <1, C>. Since A has already applied <1, A>, it just applied <1, C> now. On both A and C, meeting Y is scheduled at time t. Revised requirements. Each node must have: A totally ordered "write log" of events. The current database (obtained by applying a prefix of the write log). An *undo log* to rollback state to any event in the write log. Protocol (summary): On write: Add to the write log. Create undo log entry by recording current state. Make change to database (can be deferred) On sync message received from another node: Merge log entries of other node (merge of two sorted lists; quite easy) Rollback database up to the first newly seen entry (use undo log) Replay write log starting from first new entry, creating undo entries and changing database. On read, read from current database Properties: Unless network is fragmented, every node will *eventually* apply all updates in the same total order. (In other words, assuming no updates for sufficiently long and enough connectivity, all replicas will converge to the same state.) --> Called "eventual consistency" Total order respects order of updates originating at each server node even if clients talk to multiple servers. Reads may see data which - is out of date (not all updates may have been seen by server yet) - may change later due to rollbacks --> Applications/users have to tolerate this (works in many cases). (b), conflict detection and resolution, can be done in several ways: Use vector timestamps (VTs): Coarse-grained dependence information. Must rely on application to propagate dependence. E.g. application reads file f with VT v computes a value based on read to write to file g g's new timestamp must be larger than v but application must track and provide this information Bayou's method: When write applied (write protocol above), allow application-provided conflict detection/resolution code. Advantages: Handle commuting updates more easily E.g., debit amount X, then amount Y is same as debit amount Y, then X. No need to declare conflict on seeing "debit Y" if "debit X" already exists. Check data dependencies without requiring application tracking Conflict resolution must be deterministic so that same updates are applied on all nodes. ---- The design so far requires maintaining an indefinite write log! Idea: If a node has a complete prefix of the log, apply the prefix to the database, call it "committed", and drop it from the log. How does a node determine it has a complete prefix? Straw man: Run some protocol to synchronize all nodes to make them all up to date. Problem: Requires frequent connectivity among all nodes and prevents updates during protocol. Any disconnected node can hold up log truncations on other nodes. What if a node disappears for several days? Bayou's solution: One node is primary. Decides what to commit. At any point, primary chooses to commit any prefix of its write log, and change the total order. Every committed entry given a commit sequence number (CSN) in write-log order. Total order is revised: - All committed updates come first in CSN order. - Then all uncommitted (remaining) updates in clock order. Committed updates are never rolled back, since no update seen later can come before them (by definition of the revised order). In concept, there is no need to keep the committed prefix in the write log or the undo log. --> Any prefix of the committed prefix can be truncated (after applying it) Example: Node C has been disconnected for a while. In the meantime, the primary has committed <2, A>. C now comes back up and presents update <1, C>. In clock order, <1, C> comes before <2, A>, but since <2, A> has been committed, in the revised order, <1, C> comes after <2, A>. Hence, <2, A> is never rolled back and <1, C> must be applied after it (possibly after resolving conflicts due to <2, A>). Revised requirements. Each node maintains: A totally ordered "write log" of events. This has two parts: 1) The committed part which comes first in the order. Represented by a single number, say M, the highest CSN that this node is aware of. Committed updates are applied to the database. There are no entries in the write log or undo log for the committed part, but --> on the side, the node stores a continuous suffix of some (k) committed write entries with CSNs M, M-1, ..., M-k. These are not needed locally, but are transferred to other nodes that haven't seen the commits during syncs. 2) Uncommitted updates in clock order. The current database, all committed updates and any prefix of the uncommitted write log have been applied. An undo log to rollback any of the uncommitted applied updates. Protocol (simplified, abstracted): On write: Add to the uncommitted part of write log. Create undo log entry by recording current state. Make change to database (can be deferred) On sync (getting updates from node A): 1) Is A's M larger than my M? Yes => Ask A for missing committed updates. If A can't supply them all, get A's database snapshot. No => Skip this step 2) Get new uncommitted updates. Rollback up to first newly seen entry using undo log (or start from A's snapshot) Remove write log entries that are now known to be committed (due to (1)). Update M if A's M was larger. Replay write log starting from first new entry, creating undo entries and changing database. On read: Read from current database. A reader may also query whether the read is from a committed update or not. Properties: Same as before. ---- Evaluation *Not* transparent to applications! Requires very strange programming practices. Every "write" is actually a bunch of code, not the new bits. Check conflicts (meeting already scheduled for any attendee?). Resolve conflicts (choose a different unused meeting time?). Works well for humans in many cases. What did we learn? Ordered update log is the ground (committed) truth, not the DB. Logical clocks can be used for causal consistency.