MapReduce Algorithm Design

Jimmy Lin - jimmylin [at] umd.edu

Presenter

Jimmy Lin is an associate professor in the iSchool at the University of Maryland, with appointments in the Department of Computer Science and the Institute for Advanced Computer Studies. He graduated with a Ph.D. in computer science from MIT in 2004. Lin’s research lies at the intersection of information retrieval and natural language processing, and his current work focuses on massively-distributed data analytics in cluster-based environments.

Recently, Lin just completed an extended sabbatical at Twitter, where from 2010- 2012 he worked with the analytics infrastructure group responsible for managing the company’s petabyte-scale Hadoop-based data warehouse spanning several thousand nodes across multiple datacenters. He built infrastructure to support the insight generation activities of data scientists within the organization. Together with colleagues he built core machine-learning libraries used throughout Twitter today and revamped part of the logging pipeline that ingests over 100 terabytes per day.

In 2010, Lin published a textbook titled “Data-Intensive Text Processing with MapReduce”, co-authored with a former student. The book is freely available at http://mapreduce.me/ and has been adopted in dozens of MapReduce courses around the world. He is working on a revised second edition, from which materials for this proposed tutorial will draw.

Duration and Sessions

The proposed duration of the tutorial is three hours. Previous experience suggests that this is the right length: a shorter tutorial feels too rushed, while a longer tutorial is exhausting, for both the audience and the presenter.

Topic and Description

This tutorial will focus on designing efficient algorithms in MapReduce, building in part on the tutorial titled “Designing Good Algorithms for MapReduce and Beyond” by Afrati et al. at SoCC 2012.

The ability to horizontally scale MapReduce clusters to thousands of nodes makes it very easy to write terribly inefficient algorithms. There are also many applications where a brute-force approach is simply impractical.

A tentative list of topics that will be covered include:
• A very brief refresher of the MapReduce programming model.

  • General design patterns for controlling execution flow in MapReduce, illustrated with common tasks such as efficiently estimating very large frequency distributions.
  • Case studies of domain-specific algorithms (e.g., inverted indexing, relational join algorithms, EM algorithms)
  • Understanding various aspects of the execution framework that contribute to efficiency: data shuffling, intermediate aggregation, skewed partitions, iteration costs, etc.
  • Strategies for overcoming the above issues, illustrated with case studies.

Slides: http://www.umiacs.umd.edu/~jimmylin/publications/WWW2013-MapReduce-tutorial-slides.pdf