Lecture 1: Introduction and lab overview COURSE STRUCTURE All announcements and info at http://courses.mpi-sws.org/ds-ws16/ look at the first lab, due on Nov 11 Meetings: lectures, paper discussions Office hours/tutorials: clarifications, project help Readings: Research papers as case studies, starting with 2nd lecture please read papers before class, otherwise you lack context, and you can't pick it up by listening Midterm, term-end, repeat exam Project: build an increasingly sophisticated fault-tolerant file service (similar to Frangipani) TAs are Viktor Erdelyi, Arpan Gujarati, Riju Sen. Office hours are shown on the Web. INTRODUCTION What is a distributed system? multiple connected and cooperating computers cooperate to provide some service Examples: DNS, Network file service, MapReduce/Hadoop, Dropbox, Bitcoin, Bittorent, Tor, etc. Why distribute? to connect physically distributed entities and people to achieve security via physical isolation to tolerate faults via replication at separate sites to scale up throughput via parallel CPUs/mem/disk/net But: complex, many concurrent components new classes of problems, e.g. partial failure, unsynchronized clocks trickly to realize performance/resilience potential Lamport: A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. advice: don't distribute if a central system will work Why take this course? interesting -- hard problems, non-obvious solutions active research area -- lots of progress + big unsolved problems most real systems are distributed -- unlike 20 years ago driven by the rise of big Web services, mobile/pervasive computing hands-on -- you'll build a real system in the labs MAIN TOPICS Example: a shared file system, so users can cooperate, like dropbox but this lecture isn't about dropbox specifically just an example goal to get feel for distributed system problems lots of client computers [diagram: clients, network, vague set of servers] Topic: architecture Choice of interfaces Monolithic file server? Block server(s) -> FS logic in clients? Separate naming + file servers? Separate FS + block servers? Single machine room or unified wide area system? Wide-area dramatically more difficult. Client/server or peer-to-peer? Interact w/ performance, security, fault behavior. Topic: implementation How do clients/servers communicate? Direct network communication is pretty painful Want to hide network stuff from application logic Most systems organize distribution with some structuring framework(s) remote procedure calls (RPC), remote object invocation (RMI), distributed shared memory (DSM), MapReduce, etc. Topic: performance Distribution can hurt: network b/w and latency bottlenecks Lots of tricks, e.g. caching, threaded servers Distribution can help: parallelism, pick server near client Idea: scalable design Nx servers -> Nx total performance Need a way to divide the load by N == divide the state by N Split by user Split by file name "Sharding" or "partitioning" Rarely perfect -> only scales so far Global operations, e.g. search Load imbalance One very active user One very popular file -> one server 100%, added servers mostly idle -> Nx servers -> 1x performance Topic: fault tolerance Dropbox: ~10,000 servers; some fail http://www.datacenterknowledge.com/archives/2013/10/23/how-dropbox-stores-stuff-for-200-million-users/ Can I use my files if there's a failure? Some part of network, some set of servers Maybe: replicate the data on multiple servers Perhaps client sends every operation to both Maybe only needs to wait for one reply Opportunity: operate from two "replicas" independently if partitioned? Opportunity: can 2 servers yield 2x availability AND 2x performance? Topic: consistency == contract w/ apps/users about meaning of operations e.g. "read yields most recently written value" hard due to partial failure, replication/caching, concurrency Problem: keep replicas identical If one is down, it will miss operations Must be brought up to date after reboot If net is broken, *both* replicas maybe live, and see different ops Delete file, still visible via other replica "split brain" -- usually bad Problem: clients may see updates in different orders Due to caching or replication I make grades.txt unreadable, then TA writes grades to it What if the operations run in different order on different replicas? Consistency often hurts performance (communication, blocking) Many systems cut corners -- "relaxed consistency" Shifts burden to applications LAB: YET ANOTHER FILE SYSTEM (YFS) Lab is inspired by Frangipani, a scalable distributed file system (see paper). You will build a simplified version of Frangipani. Frangipani goals aggregate many disks into a single shared file system capacity can be incrementally added without stopping the system bricks for storage good load balance of load/users across disks no manual assignment tolerates and recovers from machine, network, and disk failures without operator intervention consistent backups while running the system Frangipani design Each machine runs a Petal server, a Frangipani Server, and a lock service this is not strictly necessary, but simplifies thinking about the system Petal aggregates disks into one big virtual disk interface: put and get replicates blocks/extents add petal servers to increase storage capacity and throughput Frangipani file server serves file system requests and uses shared petal to store data inodes, directory, data blocks all stored on petal all servers serve same file system add servers to scale up Frangipani Lock server file servers uses lock server to provide consistency e.g., when creating a file, lock directory etc. locks are also replicated No security beyond traditional file system security intended for a cluster, not wide-area Contrast: NFS (or AFS) Clients relay file system calls to server clients are simple; they don't implement a file system for performance client may implement cache (or use OS file system cache) Server implements the file system and exports its local disks How do you scale a NFS server? buy more disks partition file systems over servers or, buy file server "mainframe" How do you make a NFS server fault tolerant? buy RAID system or NAS YFS Same basic structure as Frangipani, but single extent server i.e. you don't have to implement Petal. Draw picture yfs_client (interfaces with OS through fuse) extent_server lock_server Each server written in C++ and pthreads, our own RPC library Next two lectures cover infrastructure in detail Labs: build YFS incrementally L1: simple lock server RPC semantics Programming with threads L2: yfs_server basic file server (no sharing, no storage) L3: yfsclient + extent_server reading/writing L4: yfsclient + extent_server + lock_server mkdir, unlink, and use lock server (sharing) L5: caching lock_server protocol design L6: caching extent_server consistency using lock_server L7: paxos library agreement protocol L8: fault tolerant lock server replicated state machine using paxos Lab 1: simple lock server