Lecture 5: Distributed Programming: MapReduce and Pig

Outline: 
  Why, designs, challenges
  MapReduce (Google)
  Pig (Yahoo)

Since writing a distributed application has a number of additional
challenges over sequential programming, it would be nice if there were ways
to simplify it.  Today we study a system for making writing parallel
applications easier on distributed computer systems, MapReduce, and a 
programming framework with a high-level language, Pig, that builds on 
top of MapReduce.

This lecture is also a case study of:
  Use of distributed computer systems
  Distributed computing challenges: programming, fault tolerance,
    consistency, concurrency, etc. 

One usage of distributed computer systems is running large
computations.  Typically the application is partitioned in
computations running in parallel that some times communicate.

Applications
  Scientific applications
  Large-data processing apps (indexing, search, ...)
  Etc.

Designs:
  Cluster computing
    using PCs connected by a high-speed network
  Grid computing
    using a high-speed network of supercomputers
  Volunteer/Personal computers
    aggregates Pcs on the Internet

Challenges:
  Parallelize application --- How to handle shared state?
    Network is typically a bottleneck 
    Embarrasingly parallel (run same app for different inputs, users, ..)
    Coarse-grained (computation versus communication ratio is low)
    Fine-grained (typically require parallel computer)
  How to write the application?
    Explicit messages (RPC, MPI)
    Shared memory
  Balance computations of an application across computers
    Statically  (e.g., doable when designer knows how much work there is)
    Dynamically
  Handle failures of nodes during computation
    With a 1,000 machines, is a failure likely in a 1 hour period?
    Often easier than with say banking applications (or YFS lock server)
    Many computations have no "harmful" side-effects and clear commit points
  Scheduling several applications who want to share infrastructure
    Time-sharing
    Strict partitioning

Brief examples for distributed applications:
  Data conversion
  Grep
  --> similar requirements for software infrastructure

MapReduce

  Design	
    Partition large data set into M split
      a split is equal to a 64 Mbyte part of the input typically
    Run map on each partition, which produces R local partitions
      using a partition function R
    Run reduce on each intermediate partition, which produces R output files

  Programmer interface:
    map(string key, string value)
    reduce(string key, iterator values)

  Example: word count
    split file in big splits
    a map computation 
      takes one split as input
      produces a list of words as output
      	        the output is partitions into R partitions
    a reduce computation
      takes a partition as input
      outputs the number of occurences of each word

  Example: distinct set of words
    similar to word count, but instead of counting words in reduce phase,
    output each word once

  Implementation:
    caller invokes mapreduce library
    library creates worker processes
      run map or reduce computations
    library creates one master process
      master assigns a map and reduce tasks to workers
      master is comm channel between map and reduce workers
      handles failures of workers
    map workers communicate locations of R partitions to master
    reducer works asks master for locations
      sorts input keys
      run reduce operation
    when all workers are finished, master returns result to caller

  Fault tolerance
    when worker fails
      master resets all map and reduce tasks to idle
        maps need to be reset because map's output is local and unavailable
      when map is reset, inform all reduce tasks to read input from new worker

  Semantics:
    if user map and reduce functions are deterministic, then
      output is the same as non-faulty sequential run of the program
    when reduce completes, worker renames tmp output file atomically
      reduce commit point!

  Related: Dryad (Microsoft)


Pig

Motivation
Applications for data processing and analysis often use a common set of 
basic operations, such as sorting, filtering, grouping or joining. The
MapReduce programming framework provides a low-level interface (i.e., Map
and Reduce functions that may perform arbitrary operations) that forces
programmers to manually implement these operations (or re-use existing
existing code fragments) rather than focusing on the high-level
functionality. The Pig programming framework is tailored for data processing
and analysis tasks and provides a high-level language for these operations.

  Design
    Pig latin: high-level language featuring common operations for data
      processing and analysis
      inspired by SQL
      allows embedding user-specified code
    Pig programs are compiled to multi-staged MapReduce jobs
      --> can be executed on existing MapReduce cluster
      --> faults handled by MapReduce framework

  Example: filtering data and removing duplicates (distinct set)

  Optimizations
    Program rewriting
      --> leverages techniques from SQL query optimization
    MapReduce specific optimizations
    Example: simplify previous example

  Related: DyradLINQ (Microsoft): high-level language on top of Dryad