Lecture 1: Introduction and lab overview

COURSE STRUCTURE

All announcements and info at http://courses.mpi-sws.org/ds-ws18/
 look at the first lab, due on Nov 2

Meetings: lectures, paper discussions
Office hours/tutorials: clarifications, project help

Readings: Research papers as case studies, starting with 2nd lecture
  please read papers before class, otherwise you lack context and you won't be able to pick
    it up by listening

Midterm, final, repeat exam

Project: build an increasingly sophisticated fault-tolerant file service 
	 (similar to Frangipani)
	 deeper understanding of core techniques
	 experience with distributed programming


TAs are Mohamed Alzayat, Malte Appel, Roberta de Viti. Office hours will be 
shown on the Web.


INTRODUCTION

What is a distributed system?
  multiple cooperating computers providing a service
  Examples: DNS, Email, Scalable/distributed storage, MapReduce/Hadoop, Dropbox, Bitcoin,
  Bittorent, Tor, Sensor networks, etc.
  most critical infrastructure is distributed
  
Why distribute?
  to connect physically distributed entities and people
  to achieve security via physical isolation
  to tolerate faults via replication
  to scale up throughput via parallel CPUs/mem/disk/net

But:
  complex, many concurrent components
  new classes of problems, e.g. partial failure, lack of global clock, lack of global view
  tricky to realize performance/resilience potential
    Lamport: A distributed system is one in which the failure of a
    computer you didn't even know existed can render your own computer
    unusable.

Why take this course?
  interesting -- hard problems, powerful solutions
  most real systems are distributed 
    driven by the rise of big Web services, mobile/pervasive computing
  active research area -- lots of progress + big unsolved problems
  hands-on -- you'll build a real system in the labs


MAIN TOPICS

This is a course about infrastructure, to be used by applications.
  About abstractions that seek to (partially) hide distribution from applications.
  Three big kinds of resources:
    Storage.
    Communication.
    Computation.
  [diagram: users/clients, network,  comopute servers, storage servers]

Topic: architecture
  Centralized (client or one server) or distributed control?
  Single data center or geo-distributed system?
    Wide-area much more difficult.
  Client/server or peer-to-peer?
  Interact w/ performance, security, fault behavior.

Topic: implementation
  How do clients/servers communicate?
    Direct network communication is painful
    Want to hide network stuff from application logic
  Most systems organize distribution with some structuring framework(s)
    remote procedure calls (RPC), remote object invocation (RMI), distributed shared memory (DSM),
    MapReduce, key-value stores, etc.

Topic: performance
  Distribution can help: parallelism, pick server near client
  Distribution can hurt: network b/w and latency bottlenecks
    Lots of tricks, e.g. caching, threaded/event-driven servers

  The dream: scalable throughput
    Nx servers -> Nx total throughput
  Need a way to divide the load by N, divide the state by N
    Split by user
    Split by function
    Split by data object
  Rarely perfect -> only scales so far
    Non-parallelizable operations: initialization, summarization, interaction
    Load imbalance
      One very active user, one very popular file
      -> one server 100%, added servers mostly idle
      -> Nx servers -> 1x performance
   Stragglers: slowest component determines throughput
   Some performance problems not easily addressed by scaling: e.g., latency of a single request 


Topic: fault tolerance
  1000s of servers, complex network -> always something broken
    We'd like to hide these failures from the application.
    We often want:
      Availability -- app can make progress despite failures
      Durability -- app will come back to life when failures are repaired
    Big idea: replicated servers.
      If one server crashes, client can proceed using the other(s).

Topic: consistency
  General-purpose infrastructure needs well-defined behavior.
    E.g. "Get(k) yields the value from the most recent Put(k,v)."
  Achieving good behavior is hard!
    "Replica" servers are hard to keep identical: replicas may see updates in different order
    Clients may crash midway through multi-step update.
    Servers crash at awkward moments, e.g. after executing but before replying.
    Partitioned network may make live servers look dead to some clients; risk of "split brain".
  Consistency and performance are enemies.
    Consistency requires communication, e.g. to get latest Put().
    "Strong consistency" often leads to slow systems.
    High performance often imposes "weak consistency" on applications.
  People have pursued many design points in this spectrum.


Topic: security
   Confidentiality, integrity, trust
   What if not all participants trust each other?
   What if some participants act selfishly?
   What if some participants are nosy?
   What if some participants try to disrupt the system for everyone?


LAB: YET ANOTHER FILE SYSTEM (YFS)

Lab is inspired by Frangipani, a scalable distributed file system (see
paper).  You will build a simplified version of Frangipani.
  
Frangipani goals
  aggregate many disks/servers into a single shared file system
  capacity can be incrementally added
    without stopping the system
    bricks for storage
  good load balance of load/users across disks
    no manual balancing
  tolerates and recovers from machine, network, and disk failures
    without operator intervention
  consistent backups while running the system

Frangipani design
  Each machine runs a Petal server, a Frangipani Server, and a lock service
    this is not strictly necessary, but simplifies thinking about the system
  Petal aggregates disks into one big virtual disk
    interface: put and get block
    replicates blocks/extents
    add petal servers to increase storage capacity and throughput
  Frangipani file server
    serves file system requests and uses petal to store data
      inodes, directory, data blocks all stored on petal
      all servers serve same file system
    add servers to scale up Frangipani throughput
  Lock server
    file servers use lock service to provide consistency
      e.g., when creating a file, lock directory, etc.
    locks are also replicated
  No security beyond traditional file system security
    intended for a cluster, not wide-area

Contrast: NFS (or AFS)
   Clients relay file system calls to server
     clients are simple; they don't implement a file system
     for performance client may implement cache (or use OS file system cache)
   Server implements the file system and exports its local disks
   How do you scale a NFS server?
     buy more disks
     partition file systems over servers
     or, buy file server "mainframe"
   How do you make a NFS server fault tolerant?
     buy RAID system or NAS

YFS
  Same basic structure as Frangipani, but single extent server
    i.e. you don't have to implement Petal.
  Draw picture
    yfs_client (interfaces with OS through fuse)
    extent_server
    lock_server
  Each server written in C++ and pthreads, our own RPC library
    Next two lectures cover infrastructure in detail
  Labs: build YFS incrementally
    L1: simple lock server
      RPC semantics
      Programming with threads
    L2: yfs_server
      basic file server  (no sharing, no storage)
    L3: yfsclient + extent_server
      reading/writing
    L4: yfsclient + extent_server + lock_server
      mkdir, unlink, and use lock server (sharing)
    L5: caching lock_server
      protocol design
    L6: caching extent_server
      consistency using lock_server
    L7: paxos library
      agreement protocol
    L8: fault tolerant lock server
      replicated state machine using paxos

Lab 1: simple lock server