Lecture 8: Google File System A distributed file system is one that distributes its data (possibly also metadata) across multiple hosts. Advantages: Performance (higher throughput for possibly higher latency) Scalability (more data can be stored than on a local disk) Reliability & availability (faults can be handled by replication) Widely deployed examples: NFS, AFS Design concerns How to distribute data? Map each file to single node or Split file into chunks, then map chunks to nodes Chunk size? Large => Less metadata, less need to fetch metadata Small => Less contention for small, active files How to manage distribution across machines? Addressing: How does a client find out where a particular file or file chunk is? Centralized or Distributed Replication: Failures do happen. Replicate data. Centrally coordinated Ad-hoc Consistency: File replicas must remain consistent. Sequential consistency requires synchronization, which may be inefficient. Application API: Standard POSIX API Standard POSIX API with extensions Custom, application-driven API ------------- The Google File System. Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung. SOSP '03. GFS designed with Google's specific needs in mind. - Built from inexpensive commodity machines. These fail often. Fault tolerance essential. - Modest number of large files. Metadata scalability less of an issue. - Two kinds of reads: Large streaming (sequential), small random. Reader of producer-consumer, typically. - Only one kind of write: Large, sequential append. Other writes uncommon. Producer of producer-consumer, or large & wide merges (MapReduce's shuffle?). - Concurrent appends from many clients. Consistency is important, but synchronization overhead should be minimized. - High sustained bandwidth more important than low latency. ------- Design Diagram (Figure 1 in paper) One master, several chunkservers. Each file split into data chunks of 64 MB. Each chunk replicated on many chunkservers (3 by default). Chunk servers store each chunk in a local file; benefit directly from the local buffer cache. Master holds all metadata: file namespace (directory hierarchy), file to chunks mapping (lookup table), addresses of replicas of each chunk. Client contacts master when it opens file to get replicas containing chunks. It then talks directly to chunkservers for reads and writes. Can go to any chunkserver (preferably closest) for read; elaborate protocol for writes (see later). Chunkservers periodically send Heartbeat messages to master to say they are alive. Questions: What is the disadvantage of such a large chunk size (64MB)? For small files, reduces distribution. Can cause bottlenecking. Why has such a large chunk size been chosen? Because in most workloads, reads and writes are sequential. Large chunk size implies less frequent interaction with master. Also reduces metadata on the server, and allows it to fit in memory. ------- Master metadata Held entirely in main memory. Limited per-chunk and per-file information. Organized as a (1) lookup table mapping path names to chunks, and a (2) separate map from chunks to replicas. Updates to (1) are persisted in an /operation log/ atomically (locally + remote replicas) at each update, prior to returning. The operation log is periodically checkpointed in the background. Only portion of log after last checkpoint is relevant; rest garbage-collected. (2) is never persisted. Instead, periodic communication between master and chunk server keeps this up to date. Basically, assumes that any chunkserver may fail or drop any of its chunks or leave or join at any time. Data is lost only if all replicas of a chunk fail. Suppose master fails. How does the system recover? Organization of (1): No per-directory meta-data as in ordinary FS. Full pathnames directly mapped to metadata with a lookup table. Each distinct path prefix has a read and a write lock, used for snapshotting. -------- Consistency model Relaxed consistency to allow for efficiency in common workload (MapReduce). Files typically consist of sequences of records. Primitively supports two kinds of mutations: (Concurrent) append of records. Add a record somewhere after the logical end of the file and return its offset. Upon success, it is guaranteed that the record is written atomically /at the same offset/ on all replicas at least once (some replicas may have more than one copy or may have empty gaps). (Concurrent) writes on existing chunks. A client breaks a large or multi-chunk write into several single-chunk writes. These may get interspersed with writes of other clients in different orders across chunks. So, there isn't any strong consistency wrt top-level writes. ------------ Mutation (write) protocol At any time, every chunk has one /primary/ replica, designated by master. This replica is said to have (write) /lease/ on the chunk. (Concurrent) writes on a chunk go through its primary, which establishes a total order on them. Write protocol: Update(chunk, offset, data): 1. Client asks master for all replicas of the chunk and which is primary. 2. Client sends data to all replicas. 3. Client sends update request to primary. Primary collects such update requests from all concurrent clients, totally orders them (any way it likes), forwards to replicas. 4. All replicas (incl. primary) apply changes in the order created in previous step. Respond to primary as each update is applied. 5. Primary responds to client once all replicas confirm its update. Responds with fail if any replica fails or doesn't respond (client can retry). Append protocol: Append(file, record) --> returns offset where record is appended on all replicas. Same as writing to last chunk of file with space. If not enough space, skip to next chunk, add new chunk (metadata operation, master is involved), try on new chunk. A failure from any replica will cause top-level failure, but other replicas may have written the record. Hence, retry will duplicate record on those replicas. A success with a returned offset means that record exists on all replicas at that offset. Questions: What is the exact consistency guarantee? How do readers deal with duplicate records and inconsistent data in files? What happens if the primary that has the write lease fails? Why is it necessary to totally order concurrent updates on each chunk? Why is data sent directly to all replicas, when requests are routed through the primary? How does GFS deal with very long writes, or writes that span chunks? ----------- Snapshotting Create a frozen copy of a part of the file system (file or directory). Usually for backup or working on a consistent dataset. Protocol using efficient copy-on-write: - Master revokes all leases on chunks belonging to part to be snapshotted. - Duplicate metadata in b-tree, pointing to same chunks as original. Then re-issue leases. - Upon a subsequent write request for a chunk, notice that reference count (no. of incoming pointers) is more than 1. Duplicate chunk (on same servers as original), update b-tree. ------------ Replica management Done entirely by master. Three primary goals: (1) Spread disk-utilization evenly, (2) Spread replicas of a chunk across racks (to cope with network and switch failure), (3) Do not concentrate active chunks on same machine. Master takes these into account when creating a chunk, re-replicating data (when some chunkserver leaves/dies), and rebalancing (usually for disk space or request load). ------------- Garbage collection Eager => Reclaim chunkserver storage on file delete Periodic => Reclaim periodically Lazy => Reclaim only when needed Local file systems are typically eager. GFS is periodic. Master scans metadata periodically. Identifies orhpaned chunks (those not pointed to by any file), erases their metadata. Conveys this to chunkservers during regular heartbeat responses; they can reclaim local space. Deleted file is only renamed. Metadata removed (chunks orphaned) in the next periodic scan 3 days after deletion. Possible for applications to specify eager collection on specific subsets of namespace. Needed when storage space is tight. ---------------- Stale replica detection What to do if replica is down during an update? Master keeps version number for each chunk. Incremented when a lease is requested. Chunkservers persist version numbers of each chunk (prior to update). If replica is down during update, its version number will not get updated. When it comes back up, master notices this when the chunkserver sends its chunk list to master. -------------- Handling error corruption If data is corrupted at a replica, a client may read the wrong data. For this, GFS uses checksums (on 64KB blocks) to detect corruption. In case of corruption, the client is redirected to another replica and the chunk is re-replicated.