Tag Archives: Cascading

Sept. 22, 2010 SDF Hadoop Scaling

On September 22, 2010 in Mountain View at LinkedIn, the SDForum SAM SIG hosted Ken Krugler of Bixo Labs presentation “Thinking at Scale with Hadoop.” In the past Krugler has worked with Nutch, Lucene and Solr. Now he uses Bixo, Hadoop, and Cascading, an open source project to speed up large scale web mining and analytics.

When working with Hadoop, traditional SQL developers should think in the map-reduce paradigm of fields/operations/pipes. The logical architecture is a computing cluster providing storage and execution. The execution is divide and conquer. The process steps are map, shuffle and reduce. A value is a set of fields. The key value pairs are exchanged between to the user-defined map and reduce algorithms. The map translates the input to keys and values. The system groups each unique key with all its values. The reducer translates the values of each unique key to new keys and values. Because map and reduce functions only care about current key and value, any number of mappers or reducers can invoke on an arbitrary number on any number of nodes.

Complex workflows require multiple map reduce jobs. Errors can occur when connecting and synchronizing data between them. Not seeing intent from all the low level data can make MR optimization harder. There are ways to deal with this complexity. He thinks you should model the problem as a workflow with operations on records. Cascading has tuples, pipes and operations, FlumeJava has classes and deferred execution. Pig or Hive treat it like a big database and are good for query-centric problems. Datameer (DAS) and BigSheets treat it like a big spreadsheet and are good for easier analytics.

There are experts in Hadoop and experts in SQL database development. It is nearly impossible to find someone who is an expert at both. Find an expert SQL database developer who wants to add some new tools to their toolbox. To get those tools, Krugler recommended the following Hadoop references:

Mailing list:

common-user@hadoop.apache.org

Book:

Tom White’s “Hadoop: The Definitive Guide”

Training:

Scale Unlimited: http://scaleunlimited.com/courses

Cloudera: http://www.cloudera.com/hadoop-training

Cascading: http://www.cascading.org

DAS: http://www.datameer.com

Copyright 2010 DJ Cline All rights reserved.

Aug. 18, 2009 SDF Business Intelligence in the Cloud

SDF logo2009 copyGali Lenin copyGuanlao Arnel copy

On August 18, 2009 in Palo Alto at SAP, SDForum presented “Cutting Edge Business Intelligence in the Cloud” with Lenin Gali of ShareThis. ShareThis has a widget that allows people to share what they find on the web with others on their social network. It doesn’t matter if it is FaceBook, Twitter, MySpace, or LinkedIn. Their clients include Fox Media, UsMagazine, Wired, ESPN, and movies.com. They built their IT on Amazon EC2, Cascading, Hadoop, Hive and MicroStrategy. They use Aster Data for their Data Warehouse. Text from DJCline.com

If you come from a traditional database IT background, I guarantee that you have never seen an operation like this. Cascading is the processing API for Hadoop Clusters. There are pipes, flows, branches and groups. You get event notification, can write scripts and control it at the tuple level. Hive is the data warehouse built on top of Hadoop. It supports non-complex SQL using HQL. You can build a custom map/reduce jobs for complex analytics. You can still make adhoc queries for large data sets. The Aster Data DW in the cloud is scalable commodity hardware with an Massively Parallel Processing (MPP) Architecture. It uses SQL, Map/Reduce, JDBC, ODBC, and is compatible with Extract Transfer and Load (ETL) tools. Aster Data architecture uses PostgreSQL and has a beehive heirarchy. Queens control the cluster and hold metadata while workers process and store it. If the queen fails it is replaced immediately. Text from DJCline.com

They think that all of this is easier to use and lowers their costs. They keep their headcount down and their revenue up. It works for them. The question is whether it will work elsewhere. Text from DJCline.com

08-18-09 SAP copy08-18-09 crowd pan1 copy08-18-09 sharethisslide copy

Copyright 2009 DJ Cline All rights reserved.