Lecture 1: Introduction and lab overview COURSE STRUCTURE All announcements and info at http://courses.mpi-sws.org/ds-ws18/ look at the first lab, due on Nov 2 Meetings: lectures, paper discussions Office hours/tutorials: clarifications, project help Readings: Research papers as case studies, starting with 2nd lecture please read papers before class, otherwise you lack context and you won't be able to pick it up by listening Midterm, final, repeat exam Project: build an increasingly sophisticated fault-tolerant file service (similar to Frangipani) deeper understanding of core techniques experience with distributed programming TAs are Mohamed Alzayat, Malte Appel, Roberta de Viti. Office hours will be shown on the Web. INTRODUCTION What is a distributed system? multiple cooperating computers providing a service Examples: DNS, Email, Scalable/distributed storage, MapReduce/Hadoop, Dropbox, Bitcoin, Bittorent, Tor, Sensor networks, etc. most critical infrastructure is distributed Why distribute? to connect physically distributed entities and people to achieve security via physical isolation to tolerate faults via replication to scale up throughput via parallel CPUs/mem/disk/net But: complex, many concurrent components new classes of problems, e.g. partial failure, lack of global clock, lack of global view tricky to realize performance/resilience potential Lamport: A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. Why take this course? interesting -- hard problems, powerful solutions most real systems are distributed driven by the rise of big Web services, mobile/pervasive computing active research area -- lots of progress + big unsolved problems hands-on -- you'll build a real system in the labs MAIN TOPICS This is a course about infrastructure, to be used by applications. About abstractions that seek to (partially) hide distribution from applications. Three big kinds of resources: Storage. Communication. Computation. [diagram: users/clients, network, comopute servers, storage servers] Topic: architecture Centralized (client or one server) or distributed control? Single data center or geo-distributed system? Wide-area much more difficult. Client/server or peer-to-peer? Interact w/ performance, security, fault behavior. Topic: implementation How do clients/servers communicate? Direct network communication is painful Want to hide network stuff from application logic Most systems organize distribution with some structuring framework(s) remote procedure calls (RPC), remote object invocation (RMI), distributed shared memory (DSM), MapReduce, key-value stores, etc. Topic: performance Distribution can help: parallelism, pick server near client Distribution can hurt: network b/w and latency bottlenecks Lots of tricks, e.g. caching, threaded/event-driven servers The dream: scalable throughput Nx servers -> Nx total throughput Need a way to divide the load by N, divide the state by N Split by user Split by function Split by data object Rarely perfect -> only scales so far Non-parallelizable operations: initialization, summarization, interaction Load imbalance One very active user, one very popular file -> one server 100%, added servers mostly idle -> Nx servers -> 1x performance Stragglers: slowest component determines throughput Some performance problems not easily addressed by scaling: e.g., latency of a single request Topic: fault tolerance 1000s of servers, complex network -> always something broken We'd like to hide these failures from the application. We often want: Availability -- app can make progress despite failures Durability -- app will come back to life when failures are repaired Big idea: replicated servers. If one server crashes, client can proceed using the other(s). Topic: consistency General-purpose infrastructure needs well-defined behavior. E.g. "Get(k) yields the value from the most recent Put(k,v)." Achieving good behavior is hard! "Replica" servers are hard to keep identical: replicas may see updates in different order Clients may crash midway through multi-step update. Servers crash at awkward moments, e.g. after executing but before replying. Partitioned network may make live servers look dead to some clients; risk of "split brain". Consistency and performance are enemies. Consistency requires communication, e.g. to get latest Put(). "Strong consistency" often leads to slow systems. High performance often imposes "weak consistency" on applications. People have pursued many design points in this spectrum. Topic: security Confidentiality, integrity, trust What if not all participants trust each other? What if some participants act selfishly? What if some participants are nosy? What if some participants try to disrupt the system for everyone? LAB: YET ANOTHER FILE SYSTEM (YFS) Lab is inspired by Frangipani, a scalable distributed file system (see paper). You will build a simplified version of Frangipani. Frangipani goals aggregate many disks/servers into a single shared file system capacity can be incrementally added without stopping the system bricks for storage good load balance of load/users across disks no manual balancing tolerates and recovers from machine, network, and disk failures without operator intervention consistent backups while running the system Frangipani design Each machine runs a Petal server, a Frangipani Server, and a lock service this is not strictly necessary, but simplifies thinking about the system Petal aggregates disks into one big virtual disk interface: put and get block replicates blocks/extents add petal servers to increase storage capacity and throughput Frangipani file server serves file system requests and uses petal to store data inodes, directory, data blocks all stored on petal all servers serve same file system add servers to scale up Frangipani throughput Lock server file servers use lock service to provide consistency e.g., when creating a file, lock directory, etc. locks are also replicated No security beyond traditional file system security intended for a cluster, not wide-area Contrast: NFS (or AFS) Clients relay file system calls to server clients are simple; they don't implement a file system for performance client may implement cache (or use OS file system cache) Server implements the file system and exports its local disks How do you scale a NFS server? buy more disks partition file systems over servers or, buy file server "mainframe" How do you make a NFS server fault tolerant? buy RAID system or NAS YFS Same basic structure as Frangipani, but single extent server i.e. you don't have to implement Petal. Draw picture yfs_client (interfaces with OS through fuse) extent_server lock_server Each server written in C++ and pthreads, our own RPC library Next two lectures cover infrastructure in detail Labs: build YFS incrementally L1: simple lock server RPC semantics Programming with threads L2: yfs_server basic file server (no sharing, no storage) L3: yfsclient + extent_server reading/writing L4: yfsclient + extent_server + lock_server mkdir, unlink, and use lock server (sharing) L5: caching lock_server protocol design L6: caching extent_server consistency using lock_server L7: paxos library agreement protocol L8: fault tolerant lock server replicated state machine using paxos Lab 1: simple lock server