MapReduce + SQL Using Open Source Software and Tools

Christophe Bisciglia (Cloudera, Inc), Aaron Kimball (Cloudera, Inc.)
Average rating: ****.
(4.50, 4 ratings)

Speakers:
Christophe Bisciglia, Founder & Chief Strategy Officer, Cloudera (previously led Google’s Academic Cloud Computing Initiative)
Jeff Hammerbacher, Founder & Chief Scientist, Cloudera (previously led Facebook’s Data Team)
Tom White, Founding Engineer, Cloudera (Hadoop Committer)
Aaron Kimball, Founding Engineer, Cloudera

Abstract:
Recently, there has been a lot of buzz around MapReduce, and the Apache Hadoop project. Even more recently, we have seen proprietary SQL database systems add support for MapReduce. This is great for data, but challenging if you prefer open source solutions.

In this tutorial, we will provide both background knowledge and the practical experience necessary to combine these models to get more out of your data. We will use MySQL and Hadoop preloaded with interesting data, and make the complete system available in the cloud to participants. We will focus on how to conduct analysis leveraging both models and go over, in detail, the glue necessary to make this work. A common use case will be extracting data from MySQL, conducting analysis you can’t do with SQL via MapReduce, and having the results reloaded into a MySQL database.

Objectives:
After this tutorial, participants should:

  • Understand the MapReduce programming model
  • Understand which data analysis problems are most appropriate for SQL and MapReduce
  • Understand how to decompose problems such that each piece can be solved using the optimal model
  • Understand how to connect open source MapReduce and SQL systems
  • Have hands-on experience using Hadoop and MySQL together
  • Have the skills and experience necessary to build a moderately complex data processing pipeline using both SQL and MapReduce

Description:
This tutorial will be three hours, and time will be split roughly evenly between instructional and practical components. Format will be wide open, so participants are free to interrupt, ask questions, and suggest focusing more in-depth on areas of specific interest. We will assume basic-to-intermediate knowledge of MySQL, and will most heavily target participants who are having trouble scaling with their data.

The instructional component will include:

  • Introduction to MapReduce
    • MapReduce Basics
    • Why MapReduce?
    • Common Algorithms in MapReduce
    • Rethinking in MapReduce
  • Introduction to Hadoop
    • Hadoop Basics
    • Hadoop MapReduce API
  • Working with MySQL and Hadoop
    • Hadoop and JDBC interfaces
    • Extracting data with queries
    • Extracting data with dumps
    • “Mapping” over result sets
    • Leveraging intermediate data
    • “Reducing” back into the database
  • Tips and Tricks for Data Processing Pipelines
    • Scripting database dumps to Hadoop’s Distributed File System (HDFS)
    • Chaining MapReduce jobs
    • Verifying Results, Reliability, etc

For the practical component, we will provide access to a large Hadoop (for MapReduce) installation in the cloud, as well as MySQL instances preloaded with interesting data. Users will get to write code that extracts data from MySQL (using both queries and dumps), uses MapReduce to analyze that data in greater depth, and dumps the results back into MySQL so they are available to existing systems. We will walk the users through a general data processing pipeline, step by step, but will focus on supporting them in conducting their own analysis.

Suggested Tracks:
In decreasing order of preference.

  1. Cloud Computing
  2. Data Warehousing and Business Intelligence
  3. Architecture and Technology
  4. Business and Case Studies
Photo of Christophe Bisciglia

Christophe Bisciglia

Cloudera, Inc

Christophe Bisciglia joins Cloudera from Google, where he created and managed their Academic Cloud Computing Initiative. Starting in 2007, he began working with the University of Washington to teach students about Google’s core data management and processing technologies – MapReduce and GFS. This quickly brought Hadoop into the curriculum, and has since resulted in an extensive partnership with the National Science Foundation (NSF) which makes Google-hosted Hadoop clusters available for research and education worldwide. Beyond his work with Hadoop, he holds patents related to search quality and personalization, and spent a year working in Shanghai. Christophe earned his degree, and remains a visiting scientist, at the University of Washington.

Photo of Aaron Kimball

Aaron Kimball

Cloudera, Inc.

Aaron Kimball has been working with Hadoop since early 2007. Aaron has worked with the NSF and several other universities nationally and internationally to advance education in the field of large-scale data-intensive computing. He helped create and deliver academic course materials first used at the University of Washington, which were later adopted by many other academic institutions, as well as Hadoop training materials used by several industry partners. Aaron has also worked as an independent consultant focusing on Hadoop and Amazon EC2-based systems. Aaron holds a B.S. in Computer Science from Cornell University, and an M.S. in Computer Science and Engineering from the University of Washington.

Co-presented By:

O'Reilly Media MySQL/Sun Microsystems
  • Kickfire
  • Virident
  • Infobright, Inc
  • JasperSoft
  • Intel
  • Advanced Micro Devices
  • BIRT Exchange by Actuate
  • Calpont
  • Canonical
  • Continuent
  • Dolphin Interconnect Solutions
  • Facebook
  • HiT Software, Inc.
  • IBM
  • iDashboards
  • Oracle
  • Pentaho
  • R1Soft
  • Schooner Information Technology
  • SQLstream
  • Ticketmaster
  • Zmanda, Inc.
  • Linux Journal

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Sharon Cordesse at scordesse@oreilly.com

Download the MySQL Sponsor/Exhibitor Prospectus

Media Partner Opportunities

Download the Media & Promotional Partner Brochure (PDF) for information on trade opportunities with O'Reilly conferences or contact mediapartners@ oreilly.com

Press and Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

MySQL Conference Newsletter

To stay abreast of conference news and to receive email notification when registration opens, please sign up for the MySQL Conference newsletter.

Contact Us

View a complete list of MySQL contacts.