Tag Archives: fields/operations/pipes

Sept. 22, 2010 SDF Hadoop Scaling

On September 22, 2010 in Mountain View at LinkedIn, the SDForum SAM SIG hosted Ken Krugler of Bixo Labs presentation “Thinking at Scale with Hadoop.” In the past Krugler has worked with Nutch, Lucene and Solr. Now he uses Bixo, Hadoop, and Cascading, an open source project to speed up large scale web mining and analytics.

When working with Hadoop, traditional SQL developers should think in the map-reduce paradigm of fields/operations/pipes. The logical architecture is a computing cluster providing storage and execution. The execution is divide and conquer. The process steps are map, shuffle and reduce. A value is a set of fields. The key value pairs are exchanged between to the user-defined map and reduce algorithms. The map translates the input to keys and values. The system groups each unique key with all its values. The reducer translates the values of each unique key to new keys and values. Because map and reduce functions only care about current key and value, any number of mappers or reducers can invoke on an arbitrary number on any number of nodes.

Complex workflows require multiple map reduce jobs. Errors can occur when connecting and synchronizing data between them. Not seeing intent from all the low level data can make MR optimization harder. There are ways to deal with this complexity. He thinks you should model the problem as a workflow with operations on records. Cascading has tuples, pipes and operations, FlumeJava has classes and deferred execution. Pig or Hive treat it like a big database and are good for query-centric problems. Datameer (DAS) and BigSheets treat it like a big spreadsheet and are good for easier analytics.

There are experts in Hadoop and experts in SQL database development. It is nearly impossible to find someone who is an expert at both. Find an expert SQL database developer who wants to add some new tools to their toolbox. To get those tools, Krugler recommended the following Hadoop references:

Mailing list:

common-user@hadoop.apache.org

Book:

Tom White’s “Hadoop: The Definitive Guide”

Training:

Scale Unlimited: http://scaleunlimited.com/courses

Cloudera: http://www.cloudera.com/hadoop-training

Cascading: http://www.cascading.org

DAS: http://www.datameer.com

Copyright 2010 DJ Cline All rights reserved.