Lecture 8: Google File System

A distributed file system is one that distributes its data (possibly
also metadata) across multiple hosts.

Advantages:

  Performance (higher throughput for possibly higher latency)
  Scalability (more data can be stored than on a local disk)
  Reliability & availability (faults can be handled by replication)

Widely deployed examples: NFS, AFS

Design concerns

How to distribute data?

  Map each file to single node or
  Split file into chunks, then map chunks to nodes

  Chunk size?
    Large => Less metadata, less need to fetch metadata
    Small => Less contention for small, active files

  How to manage distribution across machines?

Addressing: How does a client find out where a particular file or file
chunk is?

  Centralized or
  Distributed 

Replication: Failures do happen. Replicate data.

  Centrally coordinated 
  Ad-hoc

Consistency: File replicas must remain consistent.

  Sequential consistency requires synchronization, which may be
  inefficient.
  
Application API:

  Standard POSIX API 
  Standard POSIX API with extensions 
  Custom, application-driven API 

-------------

The Google File System. Sanjay Ghemawat, Howard Gobioff, Shun-Tak
Leung. SOSP '03.

GFS designed with Google's specific needs in mind. 

- Built from inexpensive commodity machines. These fail often. Fault
  tolerance essential.

- Modest number of large files. Metadata scalability less of an issue.

- Two kinds of reads: Large streaming (sequential), small
  random. Reader of producer-consumer, typically.

- Only one kind of write: Large, sequential append. Other writes
  uncommon. Producer of producer-consumer, or large & wide merges
  (MapReduce's shuffle?).

- Concurrent appends from many clients. Consistency is important, but
  synchronization overhead should be minimized.

- High sustained bandwidth more important than low latency.

-------

Design

Diagram (Figure 1 in paper)

One master, several chunkservers.

Each file split into data chunks of 64 MB. Each chunk replicated on
many chunkservers (3 by default). Chunk servers store each chunk in a
local file; benefit directly from the local buffer cache.

Master holds all metadata: file namespace (directory hierarchy), file
to chunks mapping (lookup table), addresses of replicas of each chunk.

Client contacts master when it opens file to get replicas containing
chunks. It then talks directly to chunkservers for reads and
writes. Can go to any chunkserver (preferably closest) for read;
elaborate protocol for writes (see later).

Chunkservers periodically send Heartbeat messages to master to say
they are alive.

Questions:

What is the disadvantage of such a large chunk size (64MB)?

  For small files, reduces distribution. Can cause bottlenecking.

Why has such a large chunk size been chosen?

  Because in most workloads, reads and writes are sequential. Large
  chunk size implies less frequent interaction with master.

  Also reduces metadata on the server, and allows it to fit in memory.

-------

Master metadata

Held entirely in main memory. Limited per-chunk and per-file
information. Organized as a (1) lookup table mapping path names to
chunks, and a (2) separate map from chunks to replicas.

Updates to (1) are persisted in an /operation log/ atomically (locally
+ remote replicas) at each update, prior to returning. The operation
log is periodically checkpointed in the background. Only portion of
log after last checkpoint is relevant; rest garbage-collected.

(2) is never persisted. Instead, periodic communication between master
and chunk server keeps this up to date. Basically, assumes that any
chunkserver may fail or drop any of its chunks or leave or join at any
time.

Data is lost only if all replicas of a chunk fail.

Suppose master fails. How does the system recover?

Organization of (1):

No per-directory meta-data as in ordinary FS. Full pathnames directly
mapped to metadata with a lookup table. Each distinct path prefix has
a read and a write lock, used for snapshotting.

--------

Consistency model

Relaxed consistency to allow for efficiency in common workload
(MapReduce). Files typically consist of sequences of records. 

Primitively supports two kinds of mutations:

  (Concurrent) append of records. Add a record somewhere after the
  logical end of the file and return its offset. Upon success, it is
  guaranteed that the record is written atomically /at the same
  offset/ on all replicas at least once (some replicas may have more
  than one copy or may have empty gaps).

  (Concurrent) writes on existing chunks. A client breaks a large or
  multi-chunk write into several single-chunk writes. These may get
  interspersed with writes of other clients in different orders across
  chunks. So, there isn't any strong consistency wrt top-level writes.

------------

Mutation (write) protocol

At any time, every chunk has one /primary/ replica, designated by
master. This replica is said to have (write) /lease/ on the
chunk. (Concurrent) writes on a chunk go through its primary, which
establishes a total order on them.

Write protocol: Update(chunk, offset, data):

1. Client asks master for all replicas of the chunk and which is
primary.

2. Client sends data to all replicas.

3. Client sends update request to primary. Primary collects such
update requests from all concurrent clients, totally orders them (any
way it likes), forwards to replicas.

4. All replicas (incl. primary) apply changes in the order created in
previous step. Respond to primary as each update is applied.

5. Primary responds to client once all replicas confirm its
update. Responds with fail if any replica fails or doesn't respond
(client can retry).

Append protocol: Append(file, record) --> returns offset where record is
appended on all replicas.

Same as writing to last chunk of file with space. If not enough space,
skip to next chunk, add new chunk (metadata operation, master is
involved), try on new chunk. A failure from any replica will cause
top-level failure, but other replicas may have written the
record. Hence, retry will duplicate record on those replicas. A
success with a returned offset means that record exists on all
replicas at that offset.

Questions:

What is the exact consistency guarantee?

How do readers deal with duplicate records and inconsistent data in files?

What happens if the primary that has the write lease fails?

Why is it necessary to totally order concurrent updates on each chunk?

Why is data sent directly to all replicas, when requests are routed
through the primary?

How does GFS deal with very long writes, or writes that span chunks?

-----------

Snapshotting

Create a frozen copy of a part of the file system (file or
directory). Usually for backup or working on a consistent dataset.

Protocol using efficient copy-on-write:

- Master revokes all leases on chunks belonging to part to be
  snapshotted.

- Duplicate metadata in b-tree, pointing to same chunks as
  original. Then re-issue leases.

- Upon a subsequent write request for a chunk, notice that reference
  count (no. of incoming pointers) is more than 1. Duplicate chunk (on
  same servers as original), update b-tree.

------------

Replica management

Done entirely by master. Three primary goals: (1) Spread
disk-utilization evenly, (2) Spread replicas of a chunk across racks
(to cope with network and switch failure), (3) Do not concentrate
active chunks on same machine.

Master takes these into account when creating a chunk, re-replicating
data (when some chunkserver leaves/dies), and rebalancing (usually for
disk space or request load).

-------------

Garbage collection

Eager => Reclaim chunkserver storage on file delete

Periodic => Reclaim periodically

Lazy => Reclaim only when needed

Local file systems are typically eager. GFS is periodic.

Master scans metadata periodically. Identifies orhpaned chunks (those
not pointed to by any file), erases their metadata. Conveys this to
chunkservers during regular heartbeat responses; they can reclaim
local space.

Deleted file is only renamed. Metadata removed (chunks orphaned) in
the next periodic scan 3 days after deletion.

Possible for applications to specify eager collection on specific
subsets of namespace. Needed when storage space is tight.

----------------

Stale replica detection

What to do if replica is down during an update?

  Master keeps version number for each chunk. Incremented when a lease
  is requested. Chunkservers persist version numbers of each chunk
  (prior to update). If replica is down during update, its version
  number will not get updated. When it comes back up, master notices
  this when the chunkserver sends its chunk list to master.

--------------

Handling error corruption

If data is corrupted at a replica, a client may read the wrong data.

For this, GFS uses checksums (on 64KB blocks) to detect corruption. In
case of corruption, the client is redirected to another replica and
the chunk is re-replicated.