Lecture 5: Distributed Programming: MapReduce and Pig Outline: Why, designs, challenges MapReduce (Google) Pig (Yahoo) Since writing a distributed application has a number of additional challenges over sequential programming, it would be nice if there were ways to simplify it. Today we study a system for making writing parallel applications easier on distributed computer systems, MapReduce, and a programming framework with a high-level language, Pig, that builds on top of MapReduce. This lecture is also a case study of: Use of distributed computer systems Distributed computing challenges: programming, fault tolerance, consistency, concurrency, etc. One usage of distributed computer systems is running large computations. Typically the application is partitioned in computations running in parallel that some times communicate. Applications Scientific applications Large-data processing apps (indexing, search, ...) Etc. Designs: Cluster computing using PCs connected by a high-speed network Grid computing using a high-speed network of supercomputers Volunteer/Personal computers aggregates Pcs on the Internet Challenges: Parallelize application --- How to handle shared state? Network is typically a bottleneck Embarrasingly parallel (run same app for different inputs, users, ..) Coarse-grained (computation versus communication ratio is low) Fine-grained (typically require parallel computer) How to write the application? Explicit messages (RPC, MPI) Shared memory Balance computations of an application across computers Statically (e.g., doable when designer knows how much work there is) Dynamically Handle failures of nodes during computation With a 1,000 machines, is a failure likely in a 1 hour period? Often easier than with say banking applications (or YFS lock server) Many computations have no "harmful" side-effects and clear commit points Scheduling several applications who want to share infrastructure Time-sharing Strict partitioning Brief examples for distributed applications: Data conversion Grep --> similar requirements for software infrastructure MapReduce Design Partition large data set into M split a split is equal to a 64 Mbyte part of the input typically Run map on each partition, which produces R local partitions using a partition function R Run reduce on each intermediate partition, which produces R output files Programmer interface: map(string key, string value) reduce(string key, iterator values) Example: word count split file in big splits a map computation takes one split as input produces a list of words as output the output is partitions into R partitions a reduce computation takes a partition as input outputs the number of occurences of each word Example: distinct set of words similar to word count, but instead of counting words in reduce phase, output each word once Implementation: caller invokes mapreduce library library creates worker processes run map or reduce computations library creates one master process master assigns a map and reduce tasks to workers master is comm channel between map and reduce workers handles failures of workers map workers communicate locations of R partitions to master reducer works asks master for locations sorts input keys run reduce operation when all workers are finished, master returns result to caller Fault tolerance when worker fails master resets all map and reduce tasks to idle maps need to be reset because map's output is local and unavailable when map is reset, inform all reduce tasks to read input from new worker Semantics: if user map and reduce functions are deterministic, then output is the same as non-faulty sequential run of the program when reduce completes, worker renames tmp output file atomically reduce commit point! Related: Dryad (Microsoft) Pig Motivation Applications for data processing and analysis often use a common set of basic operations, such as sorting, filtering, grouping or joining. The MapReduce programming framework provides a low-level interface (i.e., Map and Reduce functions that may perform arbitrary operations) that forces programmers to manually implement these operations (or re-use existing existing code fragments) rather than focusing on the high-level functionality. The Pig programming framework is tailored for data processing and analysis tasks and provides a high-level language for these operations. Design Pig latin: high-level language featuring common operations for data processing and analysis inspired by SQL allows embedding user-specified code Pig programs are compiled to multi-staged MapReduce jobs --> can be executed on existing MapReduce cluster --> faults handled by MapReduce framework Example: filtering data and removing duplicates (distinct set) Optimizations Program rewriting --> leverages techniques from SQL query optimization MapReduce specific optimizations Example: simplify previous example Related: DyradLINQ (Microsoft): high-level language on top of Dryad