Lecture 1: Introduction and lab overview

COURSE STRUCTURE

All announcements and info at http://courses.mpi-sws.org/ds-ws16/
 look at the first lab, due on Nov 11

Meetings: lectures, paper discussions
Office hours/tutorials: clarifications, project help

Readings: Research papers as case studies, starting with 2nd lecture
  please read papers before class, otherwise you lack context, and you can't pick
    it up by listening

Midterm, term-end, repeat exam

Project: build an increasingly sophisticated fault-tolerant file service 
	 (similar to Frangipani)

TAs are Viktor Erdelyi, Arpan Gujarati, Riju Sen. Office hours are
shown on the Web.


INTRODUCTION

What is a distributed system?
  multiple connected and cooperating computers
  cooperate to provide some service
  Examples: DNS, Network file service, MapReduce/Hadoop, Dropbox, Bitcoin,
  Bittorent, Tor, etc.

Why distribute?
  to connect physically distributed entities and people
  to achieve security via physical isolation
  to tolerate faults via replication at separate sites
  to scale up throughput via parallel CPUs/mem/disk/net

But:
  complex, many concurrent components
  new classes of problems, e.g. partial failure, unsynchronized clocks
  trickly to realize performance/resilience potential
  Lamport: A distributed system is one in which the failure of a
    computer you didn't even know existed can render your own computer
    unusable.
  advice: don't distribute if a central system will work

Why take this course?
  interesting -- hard problems, non-obvious solutions
  active research area -- lots of progress + big unsolved problems
  most real systems are distributed -- unlike 20 years ago
    driven by the rise of big Web services, mobile/pervasive computing
  hands-on -- you'll build a real system in the labs


MAIN TOPICS

Example:
  a shared file system, so users can cooperate, like dropbox
     but this lecture isn't about dropbox specifically
     just an example goal to get feel for distributed system problems
  lots of client computers
  [diagram: clients, network, vague set of servers] 

Topic: architecture
  Choice of interfaces
    Monolithic file server?
    Block server(s) -> FS logic in clients?
    Separate naming + file servers?
    Separate FS + block servers?
  Single machine room or unified wide area system?
    Wide-area dramatically more difficult.
  Client/server or peer-to-peer?
  Interact w/ performance, security, fault behavior.

Topic: implementation
  How do clients/servers communicate?
    Direct network communication is pretty painful
    Want to hide network stuff from application logic
  Most systems organize distribution with some structuring framework(s)
    remote procedure calls (RPC), remote object invocation (RMI), distributed shared memory (DSM),
    MapReduce, etc.

Topic: performance
  Distribution can hurt: network b/w and latency bottlenecks
    Lots of tricks, e.g. caching, threaded servers
  Distribution can help: parallelism, pick server near client
  Idea: scalable design
    Nx servers -> Nx total performance
  Need a way to divide the load by N
    == divide the state by N
    Split by user
    Split by file name
    "Sharding" or "partitioning"
  Rarely perfect -> only scales so far
    Global operations, e.g. search
    Load imbalance
      One very active user
      One very popular file
      -> one server 100%, added servers mostly idle
      -> Nx servers -> 1x performance

Topic: fault tolerance
  Dropbox: ~10,000 servers; some fail
    http://www.datacenterknowledge.com/archives/2013/10/23/how-dropbox-stores-stuff-for-200-million-users/
  Can I use my files if there's a failure?
    Some part of network, some set of servers
  Maybe: replicate the data on multiple servers
    Perhaps client sends every operation to both
    Maybe only needs to wait for one reply
  Opportunity: operate from two "replicas" independently if partitioned?
  Opportunity: can 2 servers yield 2x availability AND 2x performance?

Topic: consistency
  == contract w/ apps/users about meaning of operations
    e.g. "read yields most recently written value"
    hard due to partial failure, replication/caching, concurrency
  Problem: keep replicas identical
    If one is down, it will miss operations
      Must be brought up to date after reboot
    If net is broken, *both* replicas maybe live, and see different ops
      Delete file, still visible via other replica
      "split brain" -- usually bad
  Problem: clients may see updates in different orders
    Due to caching or replication
    I make grades.txt unreadable, then TA writes grades to it
    What if the operations run in different order on different replicas?
  Consistency often hurts performance (communication, blocking)
    Many systems cut corners -- "relaxed consistency"
    Shifts burden to applications


LAB: YET ANOTHER FILE SYSTEM (YFS)

Lab is inspired by Frangipani, a scalable distributed file system (see
paper).  You will build a simplified version of Frangipani.
  
Frangipani goals
  aggregate many disks into a single shared file system
  capacity can be incrementally added
    without stopping the system
    bricks for storage
  good load balance of load/users across disks
    no manual assignment
  tolerates and recovers from machine, network, and disk failures
    without operator intervention
  consistent backups while running the system

Frangipani design
  Each machine runs a Petal server, a Frangipani Server, and a lock service
    this is not strictly necessary, but simplifies thinking about the system
  Petal aggregates disks into one big virtual disk
    interface: put and get
    replicates blocks/extents
    add petal servers to increase storage capacity and throughput
  Frangipani file server
    serves file system requests and uses shared petal to store data
      inodes, directory, data blocks all stored on petal
      all servers serve same file system
    add servers to scale up Frangipani
  Lock server
    file servers uses lock server to provide consistency
      e.g., when creating a file, lock directory
      etc.
    locks are also replicated
  No security beyond traditional file system security
    intended for a cluster, not wide-area

Contrast: NFS (or AFS)
   Clients relay file system calls to server
     clients are simple; they don't implement a file system
     for performance client may implement cache (or use OS file system cache)
   Server implements the file system and exports its local disks
   How do you scale a NFS server?
     buy more disks
     partition file systems over servers
     or, buy file server "mainframe"
   How do you make a NFS server fault tolerant?
     buy RAID system or NAS

YFS
  Same basic structure as Frangipani, but single extent server
    i.e. you don't have to implement Petal.
  Draw picture
    yfs_client (interfaces with OS through fuse)
    extent_server
    lock_server
  Each server written in C++ and pthreads, our own RPC library
    Next two lectures cover infrastructure in detail
  Labs: build YFS incrementally
    L1: simple lock server
      RPC semantics
      Programming with threads
    L2: yfs_server
      basic file server  (no sharing, no storage)
    L3: yfsclient + extent_server
      reading/writing
    L4: yfsclient + extent_server + lock_server
      mkdir, unlink, and use lock server (sharing)
    L5: caching lock_server
      protocol design
    L6: caching extent_server
      consistency using lock_server
    L7: paxos library
      agreement protocol
    L8: fault tolerant lock server
      replicated state machine using paxos

Lab 1: simple lock server