Lecture 14: Paxos From Paxos Made Simple, by Leslie Lamport, 2001 introduction how to build a fault-tolerant service? e.g. lock server want it to look to clients like a single server replicated state machine like hypervisor paper [many clients, two servers, state, operations] if all servers see same sequence of ops they will remain replicas, so one can take over from another works for deterministic local services: lock, block storage, files how to ensure all replicas see same sequence of ops? there are many possibilities primary + backup clients send all operations to current primary primary chooses order, sends to backup, replies to client what if the primary fails? how about: backup pings primary takes over if gets no response from primary doesn't work: pings lost => two primaries pings delayed => two primaries partition => two primaries basic problem: broken network looks exactly like a dead primary but must be treated differently! can't afford to have two primaries! won't see same operations, won't have same state might give the same lock to two different clients! we'll use Paxos to agree on what nodes are alive basic idea: nodes vote, at most one majority of floor(n/2)+1 can use set of live nodes to agree on who is the primary agreed set of live nodes is a "view" protocol stack: RSM config service (view change) agreement (Paxos) config service provides "views" and "view change" system goes through a sequence of views view: view# and set of participants ensure agreement on unique successor of each view example: 0, S1, S2, S3 1, S2, S3 2, S2, S3, S4 3, S1, S2, S3, S4 two uses: current view implies current primary previous view tells new primary who to get state from what does config service need from Paxos? one or more nodes will see change in set of pingable nodes call Paxos("i+1 S1 S2") maybe just one such call, maybe more than one all same i+1 maybe all same srv list, maybe different eventually Paxos says "decided i+1 S1 S2" config srv creates separate Paxos instance for each view change so from now on in lecture, agreeing on just a single new view Paxos fault-tolerant agreement protocol general-purpose algorithm 2f+1 servers can survive up to f failures one or more servers each propose a value Paxos will not ever "agree" on more than one distinct value might never agree on anything (if too many failures) if net+servers stable long enough, and enough live Paxos will eventually choose one proposed value as agreed value general Paxos approach one (or more) nodes decide to be the proposer all participating nodes are acceptors proposer contacts acceptors, tries to assemble a majority if a majority respond, we're done why agreement is hard two proposers? partition (two proposers)? proposer crashes? proposer crashes just after majority? new proposer shouldn't choose different value... Paxos has two phases proposer can't just send "do you promise to commit to this value?" can't promise: maybe everyone promises to different value have to be able to change mind so: prepare, and accept definition: majority floor(n/2)+1 nodes definition: "chosen": a majority has accepted the value i.e. sent accept_ok majority means at most one value can be chosen exchange: proposer acceptors prepare(n) -> <- prepare_ok(n_a, v_a) from majority accept(n, v') -> <- accept_ok(n) from majority decided(v') -> why n? may need multiple rounds e.g. if a proposer crashes want later rounds to supersede earlier ones numbers allow us to compare early/late n values must be unique and roughly follow time e.g., n = , ID can be server's IP address "round" is the same as "proposal" but completely different from "instance" round/proposal numbers are WITHIN a particular instance the crucial property: if a value was chosen, any subsequent choice must be the same value i.e. protocol must not change its mind maybe a different proposer, but same value! tricky b/c "chosen" is system-wide property e.g. majority accepts, then proposer crashes no node can tell locally that agreement was reached so: proposer doesn't send out value with prepare acceptors send back any value they have already accepted if there is one, proposer proposes that to avoid changing an existing choice if no value already accepted, proposer can propose any value (e.g. a client request) proposer must get prepare_ok from majority to guarantee intersection with majority formed by existing choice to guarantee proposer hears of any previously chosen value now the protocol proposer(v): choose n, unique and higher than any n seen so far send prepare(n) to all servers including self if prepare_ok(n_a, v_a) from majority: v' = v_a with highest n_a; choose own v otherwise send accept(n, v') to all if accept_ok(n) from majority: send decided(v') to all acceptor state (per instance): must persist across reboots n_p (# of highest prepare seen) n_a, v_a (# of highest accept seen and associated value) acceptor prepare(n) handler: if n > n_p n_p = n reply prepare_ok(n_a, v_a) acceptor accept(n, v) handler: if n >= n_p n_a = n v_a = v reply accept_ok(n) example 1 (normal operation): S1, S2, S3 but S3 is dead or slow S1 starts proposal, n=1 v=A S1: p1 a1vA dA S2: p1 a1vA dA S3: dead... "p1" means Sx receives prepare(n=1) "a1vA" means Sx receives accept(n=1, v=A) "dA" means Sx receives decided(v=A) these diagrams are not specific about who the proposer is it doesn't really matter the proposers are logically separate from the acceptors we only care about what acceptors saw and replied Note Paxos only requires a majority of the servers so we can continue even though S3 was down proposer must not wait forever for any acceptor's response What would happen if network partition? I.e. S3 was alive and had a proposed value B S3's prepare would not assemble a majority Question: How does Paxos ensure that the following sequence of events can't happen? What actually happens, and which value is ultimately chosen? proposer 1 crashes after sending two accepts proposer 2 has a different value in mind A: p1 a1foo B: p1 p2 a2bar C: p1 a1foo p2 a2bar C's prepare_ok to B included "foo" thus it really would be a2foo, and so no problem the point: if the system has already reached agreement, majority will know value any new majority of prepares will intersect that majority so subsequent proposer will learn of already-agreed-on value and send it in accept msgs example 2 (concurrent proposers): A1 starts proposing n=10 A1 sends out just one accept v=10 A3 starts proposing n=11 but A1 does not receive its proposal A3 only has to wait for a majority of proposal responses A1: p10 a10v10 A2: p10 p11 A3: p10 p11 a11v11 A1 and A3 have accepted different values! what will happen? what will A2 do if it gets a10v10 accept msg from A1? -> ignore it what will A1 do if it gets a11v11 accept msg from A3? -> ok it what if A3 were to crash at this point (and not restart)? how about this: A1: p10 a10v10 p12 A2: p10 p11 a11v11 A3: p10 p11 p12 a12v10 has the system agreed to a value at this point? what's the commit point? i.e. exactly when has agreement been reached? i.e. at what point would changing the value be a disaster? after a majority has the same v_a? no -- why not? above counterexample after a majority has the same v_a/n_a? yes -- why sufficient? sketch: suppose majority has same v_a/n_a acceptors will reject accept() with lower n for any higher n: prepare's must have seen our majority v_a/n_a (overlap) what if overlap servers saw prepare(n) before accept(v_a, n_a)? would reject v_a/n_a thus wouldn't have a majority yet proposer might be free to choose v != v_a why does the proposer need to pick v_a with highest n_a? A1: p10 a10vA p12 A2: p10 p11 a11vB A3: p10 p11 a11vB p12 a12v?? n=11 already agreed on vB n=12 sees both vA and vB, but must choose vB why: two cases: 1. there was a majority before n=11 n=11's prepares would have seen value and re-used it so it's safe for n=12 to re-use n=11's value 2. there was not a majority before n=11 n=11 might have obtained a majority so it's required for n=12 to re-use n=11's value why does prepare handler check that n > n_p? it's taking max(concurrent n's), for accept handler responding to all prepare() with prepare_ok() would be also fine, but proposers with n < n_p would be ignored by accept() anyway why does accept handler check n >= n_p? required to ensure agreement there's a unique highest n active everyone favors the highest n w/o n >= n_p check, you could get this bad scenario: A1: p1 p2 a1vA A2: p1 p2 a1vA a2vB A3: p1 p2 a2vB why does accept handler update n_p = n? required to prevent earlier n's from being accepted node can get accept(n,v) even though it never saw prepare(n) without n_p = n, can get this bad scenario: A1: p1 a2vB a1vA p3 a3vA A2: p1 p2 p3 a3vA A3: p2 a2vB what if new proposer chooses n < old proposer? i.e. if clocks are not synced cannot make progress, though no correctness problem what if an acceptor crashes after receiving accept? A1: p1 a1v1 A2: p1 a1v1 reboot p2 a2v? A3: p1 p2 a2v? A2 must remember v_a/n_a across reboot! on disk might be only intersection with new proposer's majority and thus only evidence that already agreed on v1 what if an acceptor reboots after sending prepare_ok? does it have to remember n_p on disk? if n_p not remembered, this could happen: S1: p10 a10v10 S2: p10 p11 reboot a10v10 a11v11 S3: p11 a11v11 11's proposer did not see value 10, so 11 proposed its own value but just before that, 10 had been chosen! b/c S2 did not remember to ignore a10v10 can Paxos get stuck? yes, if there is not a majority that can communicate how about if a majority is available? possible to livelock: dueling proposers, keep prepare'ing higher n's one reason to try electing a leader: reduce chance of dueling proposers with single proposer and reachable majority, should reach consensus