In labs 7 and
8, you will replicate the lock service using the replicated state machine
approach. See Schneider's RSM paper for a good,
but non-required, reference. In the replicated state machine approach, one
machine is the master; the master receives requests from clients and executes
them on all replicas in the same order.
When the master
fails, any of the replicas can take over its job, because they all should have
the same state as the failed master. One of the key challenges is ensuring that
everyone agrees on which replica is the master and which of the slaves are alive,
despite arbitrary sequences of crashes or network partitions. We use Paxos to reach such an agreement.
In this lab, you
will implement Paxos and use it to agree to a sequence of membership changes (i.e.,
view changes). We will implement the replicated lock server in lab 8. We
have modified lock_smain.cc in this lab to start the
RSM instead of the lock server; however, we will not actually replicate
locks until lab 8.
When you completed
this lab and the next you will have a replicated state machine that manages a
group of lock servers. You should be able to start new lock servers, which will
contact the master and ask to join the replica group. Nodes can also be removed
from the replica group when they fail. The set of nodes in the group at a
particular time is a view, and each time the view changes, you will run
Paxos to agree on the new view.
The design we have
given you consists of three layered modules. The RSM and config layers make
downcalls to tell the layers below them what to do. The config and Paxos
modules also make upcalls to the layers above them to inform them of
significant events (e.g., Paxos agreed to a value, or a node became
unreachable).
RSM module
The RSM module is in
charge of replication. When a node joins or fails, the RSM module directs the
config module to add or remove the node. The RSM module also runs a recovery
operation on joining nodes to ensure that their state matches the other
replicas. In this lab, the only state to recovery is the sequence of Paxos
operations that have been executed. In lab 8, you will extend the RSM
module to replicate the lock service.
config module
The config module is in
charge of view management. When the RSM module asks it to add or remove nodes
from the current view, the config module invokes Paxos to agree on a new view.
The config module also sends periodic heartbeats to ensure that other nodes are
alive, and does an upcall to inform the RSM module if it can't contact some of
the members of the current view.
Paxos module
The Paxos module is in
charge of running Paxos to agree on a value. In principle the value could be
anything, although in this system it is the list of nodes constituting the next
view.
The focus in this lab is on the paxos
module, but you will implement a skeleton of the RSM module to manage nodes
joining and leaving (e.g., because they crashed.) Replicating the lock
server will happen in the next lab.
Each module has
threads and internal locks. As described above, a thread may call down through
the layers. For instance, the RSM could tell the config module to add a node,
and the config module tells Paxos to agree to a new view. When Paxos finishes,
a thread will invoke an upcall to inform higher layers of the completion. To
avoid deadlock, we suggest that you use the rule that a module releases its
internal locks before it upcalls, but can keep its locks when calling down.
Begin by
initializing your Lab 7 branch with your implementation from Lab 6.
% cd ~/lab
% git commit -am 'my solution to lab6'
Created commit ...
% git pull
remote: Generating pack...
...
% git checkout -b lab7 origin/lab7
Branch lab7 set up to track remote branch refs/remotes/origin/lab7.
Switched to a new branch "lab7"
% git merge lab6
This will add new files, paxos_protocol.h, paxos.{cc,h}, log.{cc,h}, rsm_tester.pl, config.{cc,h}, rsm.{cc,h}, and rsm_protocol.h to your lab/ directory and update the makefile from your
previous lab. It will also incorporate minor changes into your lock_smain.cc to initialize the RSM
module when the lock server starts. Note that since the RSM and the lock server both bind on the same
port, this will actually disable your lock server until lab 8, unless you
change the relevant line in lock_smain.cc back.
The lock server will now take two command-line arguments: the port that the
master, and the port that the lock server you are starting should bind to.
In rsm.{cc,h}, we have provided you with
code to set up the appropriate RPC handlers and manage recovery in this lab.
You will need to write the code to handle nodes joining and leaving.
In files paxos.{cc,h}, you will find a sketch
implementation of the paxos class that will use the
Paxos protocol to agree on view changes. The file paxos_protocol.h defines the suggested RPC
protocol between instances of Paxos running on different replicas, including
structures for arguments and return types, and marshall code for those
structures. The lion's share of the work in this lab is implementing Paxos.
The files log.{cc,h} provide a full
implementation of a log class, which should be used
by your paxos class to log important
Paxos events to disk. Then, if the node fails and later re-joins, it has some
memory about past views of the system. Please do not make any changes to this
class, as we will use our own original versions of these files during testing.
config.cc maintains views using
Paxos. You will need to understand how it interacts with the Paxos and RSM
layers, but you should not need to make any changes to it for this lab. (You
may do so if you wish, however.)
In the next lab we will test
if the replicated lock service maintains the state of replicated locks
correctly, but in this lab we will just tests if view changes happen correctly.
The lab tester rsm_tester.pl will automatically start
several lock servers, kill and restart some of them and check that you have
implemented the Paxos protocol and its use correctly.
There are two classes that
together implement the Paxos protocol: acceptor and proposer. Each replica runs both
classes. The proposer class leads the Paxos
protocol by proposing new values and sending requests to all replicas. The acceptor class processes the
requests from the proposer and sends responses. The
method
The config module performs view changes among the set of
participating nodes. The first view of the system is specified manually.
Subsequent view changes rely on Paxos to agree on a unique next view to
displace the current view.
When the system starts from
scratch, the first node creates view 1 containing itself only, i.e. view_1={1}.
When node 2 joins after the first node, the RSM module joins node 1 and
transfers view 1 from the first node as the only member, and asks the config module to add itself to view 1. The config
module then will use Paxos to propose to nodes in view_1={1} a new view_2
containing node 1 and 2. When Paxos succeeds, view_2 is formed, i.e.,
view_2={1,2}. When node 3 joins, its RSM module will download the last view
from the first node (view 2) and it will then attempt to propose to nodes in
view 2 a new view_3={1,2,3}. And so on.
The config module will also
initiate view changes when it discovers that some nodes in the current view are
not responding. In particular, the node with the smallest id periodically sends
heartbeat RPCs to all others (and all other servers periodically send
heartbeats to the node with the smallest id). If a heartbeat RPC times out, the
config module calls the proposer's
The paxos keeps track of whether the current view is
stable or not (using the paxos::stable variable). If the current
view is stable, there are no on-going Paxos view change attempts by this node
or others. When the current view is not stable, the node is inititating the
Paxos protocol or participating in Paxos initiated by others.
The paxos module logs
important Paxos events as well as a complete history of all values agreed to on
disk. At any time a node can reboot and when it re-joins, it may be many views
behind. Unless the node brings itself up-to-date on the current view, it won't
be able to participate in Paxos. By remembering all views, the other nodes can
bring this re-joined node up to date.
The Paxos Made Simple paper describes a
protocol that agrees on a value and then terminates. Since we want to run
another instance of Paxos every time there is a view change, we need to ensure
that messages from different instances are not confused. We do this by adding
instance numbers (which are not the same as proposal numbers) to all messages.
Since we are using Paxos to agree on view changes, the instance numbers in our
use of Paxos are the same as the view numbers in the config module.
Paxos can't guarantee that
every node learns the chosen value right away; some of them may be partitioned
or crashed. Therefore, some nodes may be behind, stuck in an old instance of
Paxos while the rest of the system has moved on to a new instance. If a node
gets an RPC request for an old instance, it should ignore the request. A
special RPC response (set oldinstance to true) can inform the
caller that it is behind and tell it what value was chosen for that instance.
Below is the pseudocode for
Paxos. The paxos skeleton class and protocol
contain member variables, RPCs, and RPC handlers corresponding to this code.
Except for the additions to handle instances as described above, it mirrors the
protocol described in the paper.
state:
n_a, v_a: highest proposal # and its corresponding value this node has accepted
n_h: highest proposal # seen in a prepare
my_n: the last proposal # the node has used in this round of Paxos
instance_h: highest instance we have accepted
values: a map of past instances to values
stable: "false" when running Paxos, "true" when this instance completes
on each view change, initialize state
n_a = 0
n_h = 0
my_n = 0
v_a = () // empty list
run_paxos(nodes, value)
stable = false
c_nodes = nodes;
c_value = value;
proceed to Paxos Phase 1
Paxos Phase 1
a node (or perhaps several nodes) decide to be leader (i.e. manager)
instance = instance_h+1
my_n = max(n_h, my_n)+1, append node ID // unique proposal number
sends prepare(instance, my_n) to all nodes in c_nodes
if node receives prepare(instance, n):
if instance <= instance_h:
return oldinstance(instance, values[instance])
else if n > n_h:
n_h = n
loghigh(n_h);
return prepareres(n_a, v_a)
else:
return reject()
Paxos Phase 2
if leader gets oldinstance(instance, v):
values[instance] = v
instance_h = instance
stable = true;
paxos_commit(instance, v); // this instance is done.
else if leader gets reject():
delay and restart paxos
else if leader gets prepareres from majority of nodes in c_nodes:
if any prepareres(n_i, v_i) exists such that v_i is not empty:
v = non-empty value v_i corresponding to highest n_i received
else leader gets to choose a value:
v = c_v;
send accept(instance_h+1, my_n, v) to all responders
else:
stable = true;
paxos_abort(); // this instance is done; app should recover
if node gets accept(instance, n, v):
if instance <= instance_h:
return oldinstance(instance, values[instance])
else if n >= n_h:
n_a = n
v_a = v
logproposal(instance, n, v)
return acceptres()
else
return reject()
Paxos Phase 3
if leader gets acceptres from a majority of c_nodes
values[instance] = v;
logvalue(instance, v)
stable = true;
paxos_commit(); // let invoker know we are done
send decide(instance_h, v) to acceptors
else:
stable = true;
paxos_abort();
if node gets decide(instance, v):
if instance <= instance_h:
ignore the message // or reply with oldinstance, but it won't matter
else:
values[instance] = v
instance_h = instance
logvalue(instance, v)
For a given instance of
Paxos, potentially many nodes can make proposals, and each of these proposals
has a unique proposal number. When comparing different proposals, the highest
proposal number wins. To ensure that each proposal number is unique, each
proposal consists of a number and the node's identifier. We provide you with a
struct prop_t in paxos_protocol.h that you should use for
proposal numbers; we also provide the > and >= operators for the class.
At any time a node can
decide it wants to start a view change, and start Paxos off. If nothing goes
wrong, and there are no concurrent proposals for the next view, Paxos clearly
reaches agreement. However, many nodes can become leaders at the same time,
creating conflicts that prevent an agreement from being reached. Thus, we would
like to ensure with good probability that there is only one leader at a time.
To achieve this, each leader delays a random amount of time before beginning
phase 1; furthermore if a leader learns of another instance of Paxos started
with a higher proposal number for the same view, it will delay for a random
amount of time and then attempt to lead another proposal. In this way, the
system will eventually have only one active leader with high probability.
Each replica must log
certain change to its Paxos state (in particular the n_a, v_a, and n_h fields), as well as log
every agreed value. The provided log class does this for you; please use it without modification, as
the test program depends on its output in a particular format.
In an ideal implementation
of Paxos, the leader would multicast its messages to all the members of the
current view at the same time. To simplify your implementation and make
debugging easier, it's acceptable to send RPCs one at a time. Make sure you add
the extra parameter rpcc::to(1000) to the end of your RPC
calls, or the RPC library will spend a long time attempting to contact crashed
nodes.
The measure of success for this lab is to pass the test 0-7
of rsm_tester.pl. (The remaining tests are reserved for the next lab.) The tester
starts 3 or 4 configuration servers, kill some of them, restart some of them
and check that all servers indeed go through a unique sequence of view changes
by examining their on-disk logs.
% ./rsm_tester.pl 0 1 2 3 4 5 6 7
test1...
...
test2...
...
test3...
...
test4...
...
test5...
...
test6...
...
test7...
...
test8...
...
tests done OK
Important: If rsm_tester.pl fails
during the middle of a test, the remaining lock_server processes are not killed and the log files are not
cleaned up (so you can debug the causes.). Make sure you do 'killall lock_server;
rm -f *.log' to clean up the lingering processes before running rsm_tester.pl
again.
We guide you through a
series of steps to get this lab working incrementally.
Implement the Paxos protocol
listed in the pseudocode, log view changes to disk, but do not worry about
failures yet. Also implement the RSM code needed for nodes to join: commit_change() and joinreq(). At the end of this step,
you need only be able to run 'rsm_tester.pl 0'. This test starts three configuration servers one after another
and checks that all servers go through the same three views.
When starting from scratch
(with blank on-disk logs), the first node initializes view 1 to itself (without
going through Paxos) and logs view 1 to disk. When the second node starts, it
also initializes view 1 to the first node (as specified in the constructor's
argument) and logs view 1 to disk. However, since the second node does not find
itself in view 1, it will kick the Paxos thread into action to propose view 2.
And so on for the third node.
Next, fill in the Paxos
implementation. Try to follow the pseudocode provided above, and use the RPC
protocol we provide in paxos_protocol.h. Note that though the
pseudocode shows different types of responses to each kind of RPC, our protocol
combines these responses into one type of return structure. For example, the prepareres struct can act as a prepareres, an oldinstance, or a reject message, depending on the situation.
Whenever Paxos has
successfully agreed on the new view, log the new view to disk. We have provided
logview() method (in log.cc) for you to do this.
The log class writes its content to a file in the
current directory called paxos-[port].log. Note that rsm_tester.pl will remove these logs when
a test finishes successfully, unless you comment out the second line of the cleanup() subroutine in the script. rsm_tester.pl also re-directs the stdout
and stderr of your configuration server to lock_server-[arg1]-[arg2].log. You might find these logs
useful for debugging.
Upon completing this step,
you should be able to pass 'rsm_tester.pl 0'.
Next you should handle the
simple failure cases of a single configuration server failing. Recall that when
dealing with failed nodes, paxos calls start_paxos() to kick the Paxos thread
into action.
Once this works, you should
be able to run 'rsm_tester.pl
0 1 2'.
In addition to logging new
views, modify your Paxos implementation to use the log class to log changes to n_h, and n_a and v_a when they are updated.
Convince yourself why these three values must be logged to disk if we want to
re-start a previous crashed node correctly. We have provided the code to write
and read logs in log.cc (see log::loghigh(), and log::logprop()), so you just have to make
sure to call the approriate methods at the right times.
Now you can run tests that
involve restarting a node after it fails. In particular, you should be able to
pass 'rsm_tester.pl 3 4 '. In test 4, rsm_tester.pl
starts three servers, kill the third server (the remaining two nodes should be
able to agree on new view), kill the second server ( the remaining one node
tries to run Paxos, but cannot succeed since no majority of nodes are present
in the current view), restarts the third server (it will not help with the
agreement since the third server is not in the current view), kills the third
server, restarts second server (now agreement can be reached) and finally
restarts third server.
Finally, you need to verify
that your code handles some of the tricky corner cases that Paxos is supposed
to deal with. Our test scripts do not test all possible corner cases, so you
could still have a buggy Paxos implementation after this step, but you will
have a good feel for the protocol.
In paxos.cc, we provide two methods: breakpoint1() and breakpoint2(). Your Paxos code must call breakpoint1() just after completing Phase
1, but before starting Phase 2. Similarly it must call breakpoint2() in between Phases 2 and 3.
The tester sends SIGUSR1 or SIGUSR2 to a configuration server to cause it to
exit at the respective breakpoint. (You can try this manually on the command
line with a command like 'kill
-USR1 [pid]',
but rsm_tester.pl also tests the following
cases automatically).
·
Test 5: This test starts three nodes and kills the third node. The first
node will become the leader to inititate Paxos, but the test will cause it to
crash at breakpoint 1 (at the end of Phase 1). Then the test will restart the
killed third node, which together with the remaining node should be able to
finish Paxos (ignoring the failed first node) and complete the view change
successfully. The script will verify that the Paxos logs show the correct view
changes.
·
Test 6: This test starts four nodes one by one and kills the fourth
node. The first node initiates Paxos as a leader, but the test causes it to
fail at breakpoint 2 (after phase 2.) When the fourth node re-joins the system,
the rest of the nodes should finish agreeing on the view originally proposed by
the first node, before making a new view of their own.
·
Test 7: This test is identical to test 6, except that it kills all
the remaining nodes after the first node exits. Then it restarts all slaves and
checks that they first agree on the first node's proposed view before making a
new view of their own.
By now, your code should now
reliably pass all required tests reliably, i.e. 'rsm_tester.pl 0 1 2 3 4 5 6 7'.
You will need to email your completed code (without binaries) as a
gzipped tar file to ds-assignment@mpi-sws.org by the deadline stated at the top of the page. To do this, switch
to the source directory and execute these commands:
% tar czvf MATR1-MATR2-lab7.tgz lab/
That should produce a file called
[MATR1-MATR2]-lab7.tgz in that directory, where
MATR1 and MATR2 are the matriculation numbers
of the team members. Attach that file to an email and send it to the address
above with the subject "Assignment 7 - LastName1 LastName2".
You will receive full credit
if your software passes the same tests we gave you when we run your software on
our machines.
Questions or comments
regarding this course? Please use the general course mailing list or
the teaching staff mailing list.
Top // Distributed Systems //