Lecture 14: Paxos

From Paxos Made Simple, by Leslie Lamport, 2001

introduction
 how to build a fault-tolerant service?
   e.g. lock server
 want it to look to clients like a single server

replicated state machine
 like hypervisor paper
 [many clients, two servers, state, operations]
 if all servers see same sequence of ops
   they will remain replicas, so one can take over from another
 works for deterministic local services: lock, block storage, files

how to ensure all replicas see same sequence of ops?
 there are many possibilities
 primary + backup
 clients send all operations to current primary
 primary chooses order, sends to backup, replies to client

what if the primary fails?
 how about:
   backup pings primary
   takes over if gets no response from primary
 doesn't work:
   pings lost => two primaries
   pings delayed => two primaries
   partition => two primaries
 basic problem:
   broken network looks exactly like a dead primary
   but must be treated differently!

can't afford to have two primaries!
 won't see same operations, won't have same state
 might give the same lock to two different clients!

we'll use Paxos to agree on what nodes are alive
 basic idea: nodes vote, at most one majority of floor(n/2)+1
 can use set of live nodes to agree on who is the primary
 agreed set of live nodes is a "view"

protocol stack:
 RSM
 config service (view change)
 agreement (Paxos)

config service provides "views" and "view change"
 system goes through a sequence of views
 view: view# and set of participants
 ensure agreement on unique successor of each view
 example:
   0, S1, S2, S3
   1, S2, S3
   2, S2, S3, S4
   3, S1, S2, S3, S4
 two uses:
   current view implies current primary
   previous view tells new primary who to get state from

what does config service need from Paxos?
 one or more nodes will see change in set of pingable nodes
 call Paxos("i+1 S1 S2")
   maybe just one such call, maybe more than one
   all same i+1
   maybe all same srv list, maybe different
 eventually Paxos says "decided i+1 S1 S2"
 config srv creates separate Paxos instance for each view change
   so from now on in lecture, agreeing on just a single new view

Paxos fault-tolerant agreement protocol
 general-purpose algorithm
 2f+1 servers can survive up to f failures
 one or more servers each propose a value
 Paxos will not ever "agree" on more than one distinct value
 might never agree on anything (if too many failures)
 if net+servers stable long enough, and enough live
   Paxos will eventually choose one proposed value as agreed value

general Paxos approach
 one (or more) nodes decide to be the proposer
 all participating nodes are acceptors
 proposer contacts acceptors, tries to assemble a majority
 if a majority respond, we're done

why agreement is hard
 two proposers?
 partition (two proposers)?
 proposer crashes?
 proposer crashes just after majority?
   new proposer shouldn't choose different value...

Paxos has two phases
 proposer can't just send "do you promise to commit to this value?"
   can't promise: maybe everyone promises to different value
   have to be able to change mind
 so: prepare, and accept

 definition: majority floor(n/2)+1 nodes
 definition: "chosen": a majority has accepted the value
   i.e. sent accept_ok
   majority means at most one value can be chosen

exchange:
 proposer        acceptors
     prepare(n) ->
  <- prepare_ok(n_a, v_a) from majority
     accept(n, v') ->
  <- accept_ok(n) from majority
     decided(v') ->

why n?
 may need multiple rounds e.g. if a proposer crashes
 want later rounds to supersede earlier ones
 numbers allow us to compare early/late
 n values must be unique and roughly follow time
   e.g., n = <time, server ID>, ID can be server's IP address

 "round" is the same as "proposal" but completely different from "instance"
    round/proposal numbers are WITHIN a particular instance

the crucial property:
 if a value was chosen, any subsequent choice must be the same value
   i.e. protocol must not change its mind
   maybe a different proposer, but same value!
 tricky b/c "chosen" is system-wide property
   e.g. majority accepts, then proposer crashes
   no node can tell locally that agreement was reached

so:
 proposer doesn't send out value with prepare
 acceptors send back any value they have already accepted
 if there is one, proposer proposes that
   to avoid changing an existing choice
 if no value already accepted,
   proposer can propose any value (e.g. a client request)
 proposer must get prepare_ok from majority
   to guarantee intersection with majority formed by existing choice
   to guarantee proposer hears of any previously chosen value

now the protocol

proposer(v):
 choose n, unique and higher than any n seen so far
 send prepare(n) to all servers including self
 if prepare_ok(n_a, v_a) from majority:
   v' = v_a with highest n_a; choose own v otherwise
   send accept(n, v') to all
   if accept_ok(n) from majority:
     send decided(v') to all

acceptor state (per instance):
 must persist across reboots
 n_p (# of highest prepare seen)
 n_a, v_a (# of highest accept seen and associated value)

acceptor prepare(n) handler:
 if n > n_p
   n_p = n
   reply prepare_ok(n_a, v_a)

acceptor accept(n, v) handler:
 if n >= n_p
   n_a = n
   v_a = v
   reply accept_ok(n)


example 1 (normal operation):
  S1, S2, S3
  but S3 is dead or slow
  S1 starts proposal, n=1 v=A
S1: p1    a1vA    dA
S2: p1    a1vA    dA
S3: dead...
"p1" means Sx receives prepare(n=1) 
"a1vA" means Sx receives accept(n=1, v=A)
"dA" means Sx receives decided(v=A)
these diagrams are not specific about who the proposer is
  it doesn't really matter
  the proposers are logically separate from the acceptors
  we only care about what acceptors saw and replied

Note Paxos only requires a majority of the servers
  so we can continue even though S3 was down
  proposer must not wait forever for any acceptor's response

What would happen if network partition?
  I.e. S3 was alive and had a proposed value B
  S3's prepare would not assemble a majority

Question:
  How does Paxos ensure that the following sequence of events can't
  happen? What actually happens, and which value is ultimately chosen?
  proposer 1 crashes after sending two accepts
  proposer 2 has a different value in mind
  A: p1 a1foo
  B: p1       p2 a2bar
  C: p1 a1foo p2 a2bar
  C's prepare_ok to B included "foo"
    thus it really would be a2foo, and so no problem
  the point:
    if the system has already reached agreement, majority will know value
    any new majority of prepares will intersect that majority
    so subsequent proposer will learn of already-agreed-on value
    and send it in accept msgs

example 2 (concurrent proposers):
A1 starts proposing n=10
A1 sends out just one accept v=10
A3 starts proposing n=11
  but A1 does not receive its proposal
  A3 only has to wait for a majority of proposal responses
A1: p10 a10v10 
A2: p10        p11
A3: p10        p11  a11v11
A1 and A3 have accepted different values!
what will happen?
  what will A2 do if it gets a10v10 accept msg from A1? -> ignore it
  what will A1 do if it gets a11v11 accept msg from A3? -> ok it
what if A3 were to crash at this point (and not restart)?

how about this:
A1: p10  a10v10               p12
A2: p10          p11  a11v11  
A3: p10          p11          p12   a12v10
has the system agreed to a value at this point?

what's the commit point?
  i.e. exactly when has agreement been reached?
  i.e. at what point would changing the value be a disaster?
  after a majority has the same v_a? no -- why not?  above counterexample
  after a majority has the same v_a/n_a? yes -- why sufficient?  sketch:
    suppose majority has same v_a/n_a
    acceptors will reject accept() with lower n
    for any higher n: prepare's must have seen our majority v_a/n_a (overlap)
    what if overlap servers saw prepare(n) before accept(v_a, n_a)?
      would reject v_a/n_a
      thus wouldn't have a majority yet
      proposer might be free to choose v != v_a

why does the proposer need to pick v_a with highest n_a?
A1: p10  a10vA               p12
A2: p10          p11  a11vB  
A3: p10          p11  a11vB  p12   a12v??
n=11 already agreed on vB
n=12 sees both vA and vB, but must choose vB
why: two cases:
  1. there was a majority before n=11
     n=11's prepares would have seen value and re-used it
     so it's safe for n=12 to re-use n=11's value
  2. there was not a majority before n=11
     n=11 might have obtained a majority
     so it's required for n=12 to re-use n=11's value

why does prepare handler check that n > n_p?
  it's taking max(concurrent n's), for accept handler
  responding to all prepare() with prepare_ok() would be also fine,
    but proposers with n < n_p would be ignored by accept() anyway

why does accept handler check n >= n_p?
  required to ensure agreement
  there's a unique highest n active
  everyone favors the highest n
  w/o n >= n_p check, you could get this bad scenario:
  A1: p1 p2 a1vA
  A2: p1 p2 a1vA a2vB
  A3: p1 p2      a2vB

why does accept handler update n_p = n?
  required to prevent earlier n's from being accepted
  node can get accept(n,v) even though it never saw prepare(n)
  without n_p = n, can get this bad scenario:
  A1: p1    a2vB a1vA p3 a3vA
  A2: p1 p2           p3 a3vA
  A3:    p2 a2vB

what if new proposer chooses n < old proposer?
  i.e. if clocks are not synced
  cannot make progress, though no correctness problem

what if an acceptor crashes after receiving accept?
A1: p1  a1v1
A2: p1  a1v1 reboot  p2  a2v?
A3: p1               p2  a2v?
A2 must remember v_a/n_a across reboot! on disk
  might be only intersection with new proposer's majority
  and thus only evidence that already agreed on v1

what if an acceptor reboots after sending prepare_ok?
  does it have to remember n_p on disk?
  if n_p not remembered, this could happen:
  S1: p10            a10v10
  S2: p10 p11 reboot a10v10 a11v11
  S3:     p11               a11v11
  11's proposer did not see value 10, so 11 proposed its own value
  but just before that, 10 had been chosen!
  b/c S2 did not remember to ignore a10v10

can Paxos get stuck?
  yes, if there is not a majority that can communicate
  how about if a majority is available?
    possible to livelock: dueling proposers, keep prepare'ing higher n's
      one reason to try electing a leader: reduce chance of dueling proposers
    with single proposer and reachable majority, should reach consensus