In this series of
labs, you will implement a fully functional distributed file server with the
following architecture as described in the overview.
To work correctly, the yfs servers need a locking
service to coordinate updates to the file system structures. In this lab,
you'll implement the lock service.
The core logic of
the lock service is quite simple and consists of two modules, the lock client
and lock server that communicate via RPCs. A client requests a specific lock
from the lock server by sending an acquire request. The lock server grants the requested lock to one client
at a time. When a client is done with the granted lock, it sends a release request to the server so the server can grant
the lock to another client who also tried to acquire it in the past.
In addition to
implementing the lock service, you'll also augment the provided RPC library to
ensure at-most-once execution by eliminating duplicate RPC requests.
Duplicate requests exist because the RPC system must re-transmit lost RPCs in
the face of lossy network connections and such re-transmissions often lead to
duplicate RPC delivery when the original request turns out not to be lost, or
when the server reboots.
Duplicate RPC
delivery, when not handled properly, often violates application semantics.
Here's an example of duplicate RPCs causing incorrect lock server behavior. A
client sends an acquire request for lock x, server
grants the lock, client releases the lock with a release request, a duplicate RPC for the original acquire request then arrives at the server, server
grants the lock again, but the client will never release the lock again since
the second acquire is just a duplicate. Such
behavior is clearly incorrect.
The files you will
need for this and subsequent lab assignments in this course are distributed
using the Git version control system. To
learn more about Git, take a look at the Git
user's manual, or, if you are already familiar with other version control
systems, you may find this CS-oriented
overview of Git useful. To install the files, you need to clone the
course repository, by running the commands below. You must have the Git client
installed in order to access the repository.
linux% git clone https://gitlab.mpi-sws.org/ds-ws16/yfs-lab.git
lab
Cloning into 'lab'...
Checking connectivity... done.
linux% cd lab
linux%
Git allows you to keep track
of the changes you make to the code. For example, if you are finished with one
of the exercises, and want to checkpoint your progress, you can
linux% git commit -am 'my solution for lab1 exercise9'
Created commit 60d2135: my solution for lab1 exercise9
1 files changed, 1 insertions(+), 0 deletions(-)
linux%
You can keep track of your
changes by using the git diff command. Running git
diff will display the changes to your code since your last commit,
and git diff origin/lab1 will display the changes relative to the initial code supplied
for this lab. Here, origin/lab1 is the name of the git
branch with the initial code you downloaded from our server for this
assignment.
In Lab 1, we provide you with a skeleton RPC-based lock server, a
lock client interface, a sample application that uses the lock client
interface, and a tester. Now compile and start up the lock server, giving it a
port number on which to listen to RPC requests. You'll need to choose a port
number that other programs aren't using. For example:
% cd lab
% make
% ./lock_server 3772
Now open a second terminal on the same machine and run lock_demo, giving it the port number on which the server is listening:
% cd lab
% ./lock_demo 3772
stat returned 0
The server process in the first terminal should now output
messages similar to this:
stat request from clt 1450783179
lock_demo asks the server for the
number of times a given lock has been acquired, using the stat RPC that we have provided. In the skeleton
code, this will always return 0. You can use it as an example of how to add
RPCs. You don't need to fix stat to report the actual number of acquisitions of the given lock in
this lab, but you may if you wish.
The lock client skeleton
does not do anything yet for the acquire and release operations; similarly, the
lock server does not implement any form of lock granting or releasing. Your job
in this lab is to fill in the client and server function and the RPC protocol
between the two processes.
Your first job is to
implement a correct lock server assuming a perfect underlying network. In the context of a lock
service, correctness means obeying this invariant: at any instance of time,
there is at most one client holding a lock of a given name.
We will use the program lock_tester to check the correctness
invariant, i.e. whether the server grants each lock just once at any given
time, under a variety of conditions. You run lock_tester with the same arguments as lock_demo. A successful run of lock_tester (with a correct lock
server) will look like this:
% ./lock_tester 3772
simple lock client
acquire a release a acquire a release a
acquire a acquire b release b release a
test2: client 0 acquire a release a
test2: client 2 acquire a release a
. . .
./lock_tester: passed all tests successfully
If your lock server isn't correct, lock_tester will print an error message. For example, if lock_tester complains "error: server granted a twice!", the problem
is probably that lock_tester sent two simultaneous requests for the same lock, and the server
granted the lock twice (once for each request). A correct server would have
sent one grant, waited for a release, and only then sent a second grant.
Your second job is to
augment the RPC library to guarantee at-most-once execution. We simulate
lossy networks on a local machine by setting the environmental variable RPC_LOSSY. If you can pass both the
RPC system tester and the lock_tester, you are done. Here's a
successful run of both testers:
% export RPC_LOSSY=0
% ./rpc/rpctest
simple test
. . .
rpctest OK
% killall lock_server
% export RPC_LOSSY=5
% ./lock_server 3772 &
% ./lock_tester 3772
simple lock client
acquire a release a acquire a release a
. . .
./lock_tester: passed all tests successfully
For this lab, your lock
server and RPC augmentation must pass the both rpctest and lock_tester; you should ensure it passes several times in a row to guarantee
there are no rare bugs. You should only make modifications on files rpc.{cc,h}, lock_client.{cc,h},
lock_server.{cc,h} and lock_smain.cc. We will test your code
with with our own copy of the rest of the source files and testers. You are
free to add new files to the directory as long as the Makefile compiles them appropriately,
but you should not need to.
For this lab, you will not
have to worry about server failures or client failures. You also need not be
concerned about security such as malicious clients releasing locks that they
don't hold.
In principle, you can implement whatever design you like as long
as your implementation satisfies all requirements in the "Your Job"
section and passes the tester. To be nice, we provide detailed guidance and
tips on a recommended implementation plan. You do not have to follow our
recommendations, although doing so makes your life easier and allows maximal
design/code re-use in later labs. Since this is your first lab, you should also
read the general programming tips in the lab overview
page as well.
You are encouraged to follow
the implementation directions, because in later labs you will add caching of
locks at the client and the design above lends itself to support caching
easily.
First, you should get the lock_server running correctly without
worrying about duplicate RPCs under lossy networks.
The RPC library's source
code is in the subdirectory rpc/. To use it, the lock_server creates a RPC server object (rpcs) listening on a port and registers various RPC
handlers (see an example in lock_smain.cc). The lock_client creates a RPC client object (rpcc), binds it to the lock_server's address
(127.0.0.1) and port, and invokes RPC calls (see an example in lock_client.cc).
Each RPC procedure is
identified by a unique procedure number. We have defined the acquire and release RPC numbers you will need in lock_protocol.h. Other RPC numbers defined there are for use in later labs. Note that you must still
register handlers for these RPCs with the RPC server object.
You can learn how to use the
RPC system by studying the given stat call implementation across lock_client and lock_server. All RPC
procedures have a standard interface with x+1 (x must be less than 6) arguments
and an integer return value (see the example in lock_server::stat function). The last
argument, a reference to an arbitary type, is always there so that a RPC
handler can use it to return results (e.g. lock_server::stat returns the number of acquires for a lock). The RPC handler also
returns an integer status code, and the convention is to return zero for
success and to return positive numbers otherwise for various errors. If the RPC
fails at the RPC library (e.g.timeouts), the RPC client gets a negative return
value instead. The various reasons for RPC failures at the RPC library are
defined in rpc.h under rpc_const.
The RPC system must know how
to marshall arbitrary objects into a stream of bytes to transmit over the
network and unmarshall them at the other end. The RPC library has already
provided marshall/unmarshall methods for standard C++ objects such as std::string, int, char (see file rpc.cc). If your RPC call includes
different types of objects as arguments, you must provide your own marshalling
method. You should be able to complete this lab with existing
marshall/unmarshall methods. Beware that the marshalling is done manually and
without any compile time type checking. Therefore, you need to be extra careful
to manually check that the client-side's RPC call function interface matches
the corresponding server-side's RPC handler function interface correctly.
The lock server can manage
many distinct locks. Each lock is identified by an integer of type lock_protocol::lockid_t. The set of locks is
open-ended: if a client asks for a lock that the server has never seen before,
the server should create the lock and grant it to the client. When multiple
clients simultaneously request for a given lock, the lock server must grant the
lock to each client one at a time.
You will need to modify the
lock server skeleton implementation in files lock_server.{cc,h} to accept acquire/release RPCs from the lock client,
and to keep track of the state of the locks. Here is our suggested
implementation plan.
On the server, a lock can be
in one of two states;
The RPC handler for acquire first checks if the lock is locked, and if so,
the handler blocks until the lock is free. When the lock is free, acquire changes its state to locked, then returns to the client, which indicates
that the client now has the lock. The value r returned by acquire doesn't matter.
The handler for release changes the lock state to free, and notifies
any threads that are waiting for the lock.
The class lock_client is a
client-side interface to the lock server (found in files lock_client.{cc,h}). The interface provides acquire() and release() functions that are supposed
to take care of sending and receiving RPCs. Multiple threads in the client
program can use the same lock_client object and request the same
lock name. See lock_demo.cc for an example of how an
application uses the interface. Note that a basic requirement of the client
interface is that lock_client::acquire must not return until it
has acquired the requested lock.
Both lock_client and lock_server's functions will be invoked
by multiple threads concurrently. In particular, the RPC library always
launches a new thread to invoke the RPC handler at the RPC server. Many
different threads might also call lock_client's acquire() and release() functions simultaneously.
To protect access to shared
data in the lock_client and lock_server, you need to use pthread mutexes.
Please refer to the general tips for programming
using threads. As seen from the suggested implementation plan, you also need to
use pthread condition variables to synchronize the actions among multiple
threads. Condition variables go hand-in-hand with the mutexes, please see here for more details
on programming with pthreads.
For robustness, when using
condition variables, it is recommended that when a thread that waited on a
condition variable wakes up, it checks a boolean predicate(s) associated with
the wake-up condition. This protects from spurious wake-ups from the pthread_cond_wait() and pthread_cond_timedwait() functions. For example, the
suggested logic described above lends itself to such an implementation (see how
on the lock_client, a thread that wakes up checks the state of the lock.)
In this and later labs, we
try to adhere to a simple (coarse-grained) locking convention: we acquire the
subsystem/protocol lock at the beginning of a function and release it before
returning. This convention works because we don't require atomicity across
functions, and we don't share data structures between different
subsystems/protocols. You will have an easier life by sticking to this
convention.
After your lock server has passed lock_tester under a perfect network, enable RPC_LOSSY by typing "export RPC_LOSSY=5", restart your
lock_server and try lock_tester again. If you implemented lock_server in the
simple way as described previously, you will see the lock_tester fail (or hang
indefinitely). Try to understand exactly why your lock_tester fails when
re-transmissions cause duplicate RPC delivery.
Read the RPC source code in rpc/rpc.{cc,h} and try to grasp the
overall structure of the RPC library as much as possible first by yourself without
reading the hints below.
The rpcc class handles the RPC client's function. At
its core lies the rpcc::call1 function, which accepts a
marshalled RPC request for transmission to the RPC server. We can see that call1 attaches additional RPC fields to each
marshalled request:
// add RPC fields before the RPC request data
req_header h(ca.xid, proc, clt_nonce_, srv_nonce_, xid_rep_window_.front());
req.pack_req_header(h);
What's the purpose for each field in req_header? (Hint: many of
them are going to help you implement at-most-once delivery.) After call1 has
finished preparing the final RPC request, it sits in a "while(1)"
loop to (repeatedly) update the timeout value for the next retransmission and
waits for the corresponding RPC reply or timeout to happen. Also, if the
underlying (TCP) connection to the server fails, rpcc automatically re-connects
to the server again (in function get_refconn) in order to perform
retransmissions.
The rpcs class handles the RPC server's function. When
the underlying connections have received a RPC request message, the function rpcs::got_pdu is invoked to dispatch the
RPC request to a thread pool. The thread pool (class ThrPool) consists of a fixed number
of threads that execute the rpcs::dispatch function to dispatch a RPC request to the registered RPC handler.
The dispatch function extracts various RPC fields from the request. These
fields include the RPC procedure number which is used to find the corresponding
handler. Additionally, they also provide sufficient information for you to
ensure the server can eliminate all duplicate requests.
Question: Our suggested
implementation of lock server uses "blocking" RPC handlers, i.e. the
server-side RPC handler functions can be blocked waiting for some external
events from the clients. With "blocking" RPC handlers, how many
concurrent "blocking" lock acquire requests can the server handle?
(Hint: our implementation of rpcs currently uses a thread pool of 10 threads).
How to ensure at-most-once
delivery? A strawman approach is to make the server remember all unique RPCs
ever received. Each unique RPC is identified by both its xid (unique across a client instance) and clt_nonce (unique across all client
instances). In addition to the RPC ids, the server must also remember the actual
values of their corresponding replies so that it can re-send the (potentially
lost) reply upon receiving a duplicate request without actually executing the
RPC handler. This strawman guarantees at-most-once, but is not ideal since the
memory holding the RPC ids and replies can grow indefinitely. A better
alternative is to use a sliding window of remembered RPCs at the server. Such
an approach requires the client to generate xid in a strict sequence, i.e. 0, 1, 2, 3... When can the server
safely forget about a received RPC and its response, i.e. sliding the window
forward?
Once you figure out the
basic design for at-most-once delivery, go ahead and realize your
implementation in rpc.cc (rpc.cc is the only file you should be modifying).
Hints: you need to add code in two places, rpcs:add_reply to remember the RPC reply values and rpcs::checkduplicate_and_update to eliminate duplicate xid
and update the appropriate information to help the server safely forget about
certain received RPCs.
After you are done with step
two, test your RPC implementation with ./rpc/rpctest and RPC_LOSSY set to 0 ("export RPC_LOSSY=0"). Make sure ./rpc/rpctest passes all tests. Once your
RPC implementation passes all these tests, test your lock server again in a
lossy environment by restarting your lock_server and lock_tester after setting RPC_LOSSY to 5 ("export RPC_LOSSY=5").
You will need to email your completed code (without binaries) as a
gzipped tar file to ds-assignment@mpi-sws.org by the deadline stated at the top of the page. To do this, switch
to the source directory and execute these commands:
% tar czvf MATR1-MATR2-lab1.tgz lab/
That should produce a file called [MATR1-MATR2]-lab1.tgz in that directory, where MATR1
and MATR2 are the
matriculation numbers of the team members. Attach that file to an email and
send it to the address above with the subject "Assignment 1 - LastName1
LastName2".
You will receive full credit
if your software passes the same tests we gave you when we run your software on
our machines.
Questions
or comments regarding this course? Please use the general course mailing list or
the teaching staff mailing list.
Top // Distributed Systems //